Skip to main content.
home | support | download

Back to List Archive

RE: index size versus searching for quoted string trade

From: Peter Karman <karman(at)not-real.cray.com>
Date: Thu Jun 17 2004 - 13:14:41 GMT
Andrew Payne wrote on 06/16/2004 03:27 PM:
> I just had a similar problem. I've got a minimum word length of 3 characters
> defined to help keep my indexes manageable, but this causes the search to
> fail when one of the terms is less than 3 characters long. 

Have you done any testing to see whether the minimum word length makes 
indexing or searching measureably slower? The reason I ask is that I did 
a test on this last year (I think it was vers 2.2.x, but I can't 
remember) and it made little difference in either searching or indexing 
if my min word count was 1 or 3. I think my doc set was about 30000 HTML 
docs.



It might be cool
> to store in the index the rules that exclude content from being indexed, and
> apply those rules to the search terms before searching. I've tried using -c
> to include the config file when searching, (hoping that the minimum length
> rule would be applied to the search terms as well) but the config file
> doesn't seem to apply, or at least that part doesn't, when searching. I've
> written a filter into my search page, but it's not at all portable. 
> 

I see some wisdom in storing the word length and ignoring query words 
based on that. But other configs like WordCharacters, etc.? It seems 
like that kind of parsing would slow down the query parser 
significantly. You'd almost have to apply the html parser to the query, 
after all the and's, or's, not's, ()'s, ""'s, etc. had been eval'd.

I believe Bill has asked for help re-writing the query parser, but no 
one has been brave/foolish enough to step forward. This might be a 
useful feature to add when that re-write happens.

Just for the archives, I see that swish_words.c has commented out the 
routine that might check for min length:

      /* limit by stopwords, min/max length, max number of digits, ... */
      /* ------- processed elsewhere for search ---------
         if (!isokword(sw, self->word, indexf))
             continue;
   ...
      */

> As a related question, what does a minimum word length do to the phrase
> search capability? Does the phrase search just work by word adjacency. If
> so, applying the same rules to the search terms would still allow phrase
> searches to match (while adding a little ambiguity.)

I believe that if a word doesn't meet the minimum length, it is skipped 
and the word position is *not* bumped. Thus if a word is too short, it 
isn't indexed, and words on either side of it are considered 'adjacent'.

Unlike IgnoreWords, however, the skipped word (as you noted) is not 
automatically removed from the query, resulting in confusing no-hits.

My test:

karman@topaz08 322% cat test.txt
hello to all the world
karman@topaz08 323% cat config
#IgnoreWords File: ./ignore
DefaultContents TXT*
MinWordLimit 3


karman@topaz08 318% swish-e -i test.txt -c config -v 3
Parsing config file 'config'
Indexing Data Source: "File-System"
Indexing "test.txt"

Checking file "test.txt"...
   test.txt - Using TXT2 parser -  (4 words)

Removing very common words...
no words removed.

karman@topaz08 319% swish-e -w '"hello world"'
# SWISH format: 2.4.1
# Search words: "hello world"
# Removed stopwords:
err: no results
.
karman@topaz08 320% swish-e -w '"hello to all the world"'
# SWISH format: 2.4.1
# Search words: "hello to all the world"
# Removed stopwords:
err: no results
.
karman@topaz08 321% swish-e -w '"hello all the world"'
# SWISH format: 2.4.1
# Search words: "hello all the world"
# Removed stopwords:
# Number of hits: 1
# Search time: 0.196 seconds
# Run time: 0.237 seconds
1000 test.txt "test.txt" 23
.
karman@topaz08 324% swish-e -w '"hello all"'
# SWISH format: 2.4.1
# Search words: "hello all"
# Removed stopwords:
# Number of hits: 1
# Search time: 0.186 seconds
# Run time: 0.227 seconds
1000 test.txt "test.txt" 23
.



> 
> -Andy
> 
> -----Original Message-----
> From: Bill Schell [mailto:friedfish@optonline.net]
> Sent: Wednesday, June 16, 2004 09:24
> To: Multiple recipients of list
> Subject: [SWISH-E] index size versus searching for quoted string
> tradeoffs (possible
> 
> 
> I just throughly confused myself by searching for a phrase ("Text of 
> Report") that I knew
> was in the documents I had just indexed.   I couldn't find it!  After 
> some head scratching
> I realized that the word 'of' is in the file cited in the IgnoreWords 
> configuation directive.
> 
> If this confused me, it will *really* confuse my users, who know nothing 
> about any
> IgnoreWords file.   They would have to figure out the they should enter 
> "Text Report",
> although that is not what is in the document.   The only immediate fix I 
> can think of for this
> is to get rid of the IgnoreWords directive, which will make my indices 
> bigger and slower to
> search.
> 
> I'm wondering if a future version of swish-e should  remove words cited in
> the IgnoreWords file from all search terms?  Or is the performance loss 
> on removing the
> IgnoreWords directive for a reasonable set of common english words not 
> worth worrying
> about?
> 
> Bill

-- 
Peter Karman - Software Publications Programmer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Thu Jun 17 13:14:44 2004