These are fairly ordinary html files, except that the titles can go up to
900 characters. They contain, in all, approximately 105,000 words.
At 08:04 AM 8/31/00, you wrote:
>At 07:36 AM 08/31/00 -0700, Frank Heasley wrote:
> >OS Redhat Linux v 6.1
> >pII 233
> >128Mb RAM
> >3,000 files, 1-2k each
> >v 1.3.0: 8 minutes, 97.6% (no Meta indexxing)
> >v 1.3.2: 74 minutes, 99.7% of RAM (with Meta indexxing)
> >v2.0.1: 77 minutes, 99.5% of RAM (with Meta indexxing)
>Wow, what do you have in those files? I think something is broken. What
>else is running? Do you have any swap space?
>I'm running Suse Linux with P550 128M. Twice your number of files (6414)
>all about 1-2k each with quite a few meta tags and it indexes in 33 seconds.
>MetaNames SUBJECT TITLE DESCRIPTION URLS IDENTIFIER KEYWORDS CREATOR
>CATEGORY AUTHOR PUBLISHER
>PropertyNames CATEGORY SUBJECT
> > wc -w *.htm | grep total
>1239639 total words
> > ll | wc -l
> 6434 total files
> > ./swish -c swish_no_stem.conf
>Indexing Data Source: "File-System"
>Removing very common words...
>8 words removed.
>0 words removed not in common words array:
>Writing main index...
>Computing hash table ...
>Writing header ...
>Writing index entries ...
>Writing stopwords ...
>28016 unique words indexed.
>Writing file index...
>Writing file list ...
>Writing file offsets ...
>Writing MetaNames ...
>Writing offsets (2)...
>6414 files indexed.
>Running time: 38 seconds.
Received on Thu Aug 31 15:32:48 2000