Skip to main content.
home | support | download

Back to List Archive

Re: New version: swish-e-1.3.2-PHRASEo.tar.gz

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jun 15 2000 - 16:36:08 GMT
Hi Jose,

Indexing speed is amazing!

I didn't hear back from you about making swish split up query words on
WordCharacters.  What was your thinking about this issues?

Again, it's my feeling that people tend to break up words on white space,
but that's not how swish breaks up words.  So if a source document contained:

             food-lover

and swish didn't contain the dash in WordCharacters it would be indexed as
two separate words.

So if someone queried like this (say but cutting-n-pasting from the source
text)

        swish -w 'keywords=food-lover'

That swish should split it on WordCharacters and convert it to:

        swish -w 'keywords=food lover'

And that would be done to the query once per index file searched, and
before assigning word positions to the words (so before any phrasing was
taking into consideration).

Otherwise, searching for text copied exactly from the source document would
fail to find results.  I would find that confusing as a searcher.

Also, I probably missed something in the discussion, but why does the old
version pull out automatic stop words and the new version doesn't for the
same config file?


> ~/phrase/swish-e-1.3.2-PHRASEo/src/swish-e -c swish.cfg
Indexing Data Source: "File-System"
Indexing /docs..
Removing very common words...
no words removed.
Writing main index...
Computing hash table ...
Writing header ...
Writing index entries ...
Writing stopwords ...
26765 unique words indexed.
Writing file index...
Writing file list ...
Writing file offsets ...
Writing MetaNames ...
Writing offsets (2)...     << ---- what does the (2) mean?
6396 files indexed.
Running time: 36 seconds.  <<--- Wow!
Indexing done!

** "Old" SRE version of swish-e **

> ~/swish/swish-e -c swish.cfg
Indexing Data Source: "File-System"

Removing very common words... 5 words removed.
5 words removed not in common words array:
and, http, of, to, www,        <<<--- these weren't pulled above

Writing main index... 29367 unique words indexed.
Writing file index... 6396 files indexed.
Running time: 8 minutes, 3 seconds.   <<==== SLOW!
Indexing done!

Thanks again, and thanks for the great job on this update!

Bill Moseley
mailto:moseley@hank.org
Received on Thu Jun 15 12:38:59 2000