Re: index size versus searching for quoted string tradeoffs

From: Peter Karman <karman(at)>
Date: Thu Jun 17 2004 - 12:43:02 GMT
I believe that vers 2.4.x *does* remove stopwords (IgnoreWords) from 
queries as well as ignoring them during indexing.

I just did this test:

karman@topaz08 299% cat test.txt
hello to all the world
karman@topaz08 300% cat config
IgnoreWords File: ./ignore
DefaultContents TXT*

and then indexed:
karman@topaz08 295% swish-e -i test.txt -c config
Indexing Data Source: "File-System"
Indexing "test.txt"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 2 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
2 unique words indexed.
4 properties sorted.
1 file indexed.  23 total bytes.  2 total words.
Elapsed time: 00:00:01 CPU time: 00:00:00
Indexing done!

and then searched:

karman@topaz08 296% swish-e -w 'hello world'
# SWISH format: 2.4.1
# Search words: hello world
# Removed stopwords:
# Number of hits: 1
# Search time: 0.224 seconds
# Run time: 0.266 seconds
1000 test.txt "test.txt" 23
karman@topaz08 297% swish-e -w 'hello to all the world'
# SWISH format: 2.4.1
# Search words: hello to all the world
# Removed stopwords: to all the
# Number of hits: 1
# Search time: 0.223 seconds
# Run time: 0.264 seconds
1000 test.txt "test.txt" 23

you'll note that both queries found the document.

Am I totally misunderstanding your question?


Bill Schell wrote on 06/16/2004 11:24 AM:
> I just throughly confused myself by searching for a phrase ("Text of 
> Report") that I knew
> was in the documents I had just indexed.   I couldn't find it!  After 
> some head scratching
> I realized that the word 'of' is in the file cited in the IgnoreWords 
> configuation directive.
> If this confused me, it will *really* confuse my users, who know nothing 
> about any
> IgnoreWords file.   They would have to figure out the they should enter 
> "Text Report",
> although that is not what is in the document.   The only immediate fix I 
> can think of for this
> is to get rid of the IgnoreWords directive, which will make my indices 
> bigger and slower to
> search.
> I'm wondering if a future version of swish-e should  remove words cited in
> the IgnoreWords file from all search terms?  Or is the performance loss 
> on removing the
> IgnoreWords directive for a reasonable set of common english words not 
> worth worrying
> about?
> Bill

Peter Karman - Software Publications Programmer - Cray Inc
phone: 651-605-9009 -
Received on Thu Jun 17 12:43:04 2004