Skip to main content.
home | support | download

Back to List Archive

Re: New version: swish-e-1.3.2-PHRASEo.tar.gz

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Fri Jun 16 2000 - 07:44:35 GMT
Hi Bill,

> I didn't hear back from you about making swish split up query words on
> WordCharacters.  What was your thinking about this issues?
> 
> Again, it's my feeling that people tend to break up words on white space,
> but that's not how swish breaks up words.  So if a source document contained:
> 
>              food-lover
> 
> and swish didn't contain the dash in WordCharacters it would be indexed as
> two separate words.
> 
> So if someone queried like this (say but cutting-n-pasting from the source
> text)
> 
>         swish -w 'keywords=food-lover'
> 
> That swish should split it on WordCharacters and convert it to:
> 
>         swish -w 'keywords=food lover'
> 
> And that would be done to the query once per index file searched, and
> before assigning word positions to the words (so before any phrasing was
> taking into consideration).
> 
> Otherwise, searching for text copied exactly from the source document would
> fail to find results.  I would find that confusing as a searcher.
> 

I am on it, but I want to make it the best possible way. This issue has
to deal
with several index files and possible diferent wordcharcters in each
one. This
problem also appears in merging files wich it is now fixed. Also it is
neccesary
to fix the parser. Up to now, the parser splits the search an then opens
the
index file. This has to be changed.

> Also, I probably missed something in the discussion, but why does the old
> version pull out automatic stop words and the new version doesn't for the
> same config file?
> 
> Writing offsets (2)...     << ---- what does the (2) mean?

For faster search I implemented a hash approach for direct search (no
wildcard).
Old version has to read all of the words with the same first char until
the
word you are looking for is found.
This hash approach needs a second write to the index file.

> 6396 files indexed.
> Running time: 36 seconds.  <<--- Wow!
> Indexing done!
> 
> ** "Old" SRE version of swish-e **
> 
> > ~/swish/swish-e -c swish.cfg
> Indexing Data Source: "File-System"
> 
> Removing very common words... 5 words removed.
> 5 words removed not in common words array:
> and, http, of, to, www,        <<<--- these weren't pulled above

Although I tested IgnoreLimit option, probably you have found a bug.
Can you send me your config file?

Jose Ruiz

jmruiz@boe.es
Received on Fri Jun 16 03:50:41 2000