Skip to main content.
home | support | download

Back to List Archive

What's wrong here??

From: Michael <michael(at)not-real.insulin-pumpers.org>
Date: Thu Apr 11 2002 - 21:37:52 GMT
It seems that almost every site I try to index some how zaps the 
final index. Example below for http://www.sugarcharity.org/

nothing unique about his site, just small

http://www.sugarcharity.org/page3.html

contains an assortment of words that are probably NOT common and 
should appear in the index but do not????  

"letter, applicant, pump, financial, income postal, employer", etc...

doing the index results in a file
ls -l swish.index 
-rw-r--r--    1 spider   users       49723 Apr 11 13:16 swish.index

using
swish-e -V
SWISH-E 2.0
really 2.05

tmp.config contains
IndexFile ./swish.index
MetaNames author description datamodified
IndexReport 3
FollowSymLinks yes
UseStemming yes
PropertyNames author description datamodified
IgnoreTotalWordCountWhenRanking yes
MinWordLimit 4
WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-_'"
#IgnoreLimit 80 1000
IgnoreWords SwishDefault
IndexComments 0
NoContents .gif .xbm .au .mov .mpg .pdf .ps .jpeg .jpg
MaxDepth 4
Delay 5

command line
swish-e -i http://www.sugarcharity.org -c tmp.config -l -v 3 -S http

results from this at the end of message

but.....
/usr/local/bin/swish-e -t HBthe -w "letter" -m 0 -f swish.index

says...

# Swish-e format 2.0
# 
# Name: (no name)
# Saved as: swish.index
# Counts: 7 words
# Indexed on: 11/04/2002 13:07:47 PDT
# Description: (no description)
# Pointer: (no pointer)
# Maintained by: (no maintainer)
# DocumentProperties: Enabled
# Stemming Applied: 1
# Soundex Applied: 0
# WordCharacters: '-.0123456789_abcdefghijklmnopqrstuvwxyz
# MinWordLimit: 4
# MaxWordLimit: 40
# BeginCharacters: "&'(0123456789abcdefghijklmnopqrstuvwxyzSO
# EndCharacters: "'),.0123456789\abcdefghijklmnopqrstuvwxyzSO
# IgnoreFirstChar: "'(
# IgnoreLastChar: "'),.;
# SWISH format 2.0
err: the index file(s) is empty

HELLO!!! what is this?? why is the index reported as empty?
This is happening on many sites that have successfully indexed in the 
past but now return an index file with the same size as above. It 
appears that something has broken that is date related.

results of index operation

Indexing Data Source: "HTTP-Crawler"
Indexing http://www.sugarcharity.org..
retrieving http://www.sugarcharity.org (0)...
 (35 words)
retrieving http://www.sugarcharity.org/index.htm (1)...
 (35 words)
Skipping ...<snip>
retrieving http://www.sugarcharity.org/page2.html (1)...
 (21 words)
Skipping http://www.canadianbutterfly.ca/:  Wrong method or server.
retrieving http://www.sugarcharity.org/page3.html (1)...
 (132 words)
retrieving http://www.sugarcharity.org/page4.html (1)...
 (85 words)
Skipping ... <snip>
http://www.sugarcharity.org/page5.html (1)...
 (101 words)
Skipping ... <snip>
retrieving http://www.sugarcharity.org/page6.html (1)...
 (98 words)
Removing very common words...
360 words removed.
24 words removed not in common words array:
124, amp, put, 4.0, ne, ha, tax, t4, ag, sex, 495, l7t, 2x5, ask, pat,
  zip, dai, 2, 00, p.m, moo, big, box, ad, Writing main index...
Computing hash table ... Writing header ... Writing index entries ...
Writing stopwords ... no unique words indexed. Writing file index...
Writing file list ... Writing file offsets ... Writing MetaNames ...
Writing offsets (2)... 7 files indexed. Running time: 37 seconds.
Indexing done!

Michael@Insulin-Pumpers.org
Received on Thu Apr 11 21:39:19 2002