Skip to main content.
home | support | download

Back to List Archive

Re: Win32 PHRASE search

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Tue Apr 25 2000 - 16:09:26 GMT
Hi,

Bill Moseley wrote:
> 
> At 10:43 AM 04/24/00 -0700, Jose Manuel Ruiz wrote:
> >More about stop words...
> >
> >In config.h you can find the following line:
> >
> >#define IGNORE_STOPWORDS_IN_QUERY 1
> 
> Oh, I wonder if there isn't a problem with the SRE's change in that routine
> in search.c?  I'm getting a segfault:
> 
> For example, a document that contains this:
> 
>    "...with a searchable database of over 5,000 recipes..."
> 
> Searching for (non-phrase search)
> 
> > ../swish-e -w 'keywords=(database of over)'

If you do not have change PHRASE_DELIMITER_CHAR you
must do

./swish-e -w 'keywords="database of over"' 

The modification to the parser is very simple. It looks
for the delimiter char and change the search to:

keywords=(database precd of precd over)

Read it as: database precedes of precedes over

I do not know if this is the problem but, at least, a 
parse error may be displayed.
The search is solved like an and but considering positions.

Anyway, I will check it.

> >So, I am wondering if IGNORE_STOPWORDS_IN_QUERY has any sense now.
> >It always has to be enabled!!
> 
> Exactly.  I can't see any reason that should be an option.  I wonder if
> that was added as a step to fix the broken search.c logic.
> 

Perhaps, I will remove it in next version.

> >> Plus, I really think that swish should parse text on searching exactly like
> >> it does on indexing.  Otherwise, it is very confusing as you can't search
> >> for text cut directly from the source document and expect it to work.  That
> >> means the wordchars, ignore first and last, and other settings would need
> >> to be saved in the index file (just like the Use Stemming: setting).
> >>
> >
> >Yes, it should work that way. But this can be a major change. Let me
> >look at the code... There are other things that may be also included in the
> >index file.
> 
> I had that working at one point -- well kind of working -- but my C skills
> are not good enough to really get it right.  But it was a hack.
> 
> The problems I had were I had to read the swish.conf file on every search
> to get Wordcharacters and other settings used to determine what defines a
> "swish" word.  The solution is to put those settings in the index file header.
>
I agree
 
> The second problem I had was with multiple indexes.  I needed to rewrite
> the logic so the parsing of query words was done on a per-index file basis
> instead of just at the start of search.c.  This is because different index
> files could have different settings used to define "swish" words.
>
You are right

> Frankly, the entire search.c parsing always has bugged me.  It's full of
> hacks now that look for wild cards, or make exceptions if a meta tag name
> is found.  Seems like the query needs to be somehow parsed into a better
> syntax tree, but that's way above my skills.
> 

You are right once again. At this point (not released yet) I have
removed
expandstar and getmatchword and add the "wild card" functionality to
getfileinfo
because it is faster and more clear. Now, when you search for "a*" you
can even 
find the word "and"!!

> But I need some ideas on how to solve this problem:
> 
> Say I have three meta fields: "title", "description", and "subject".
> 
> I concatenate the three into one field "keywords".  This means I can use
> swish to search any single field, or, by using "keywords" I can search all
> fields at once (as in my gdb example above).  But that has the problem that
> a phrase can span meta fields when searching "keywords".
> 
> One ugly solution would be for me to add some non-word when concatenating
> into one field so phrase would never span fields.
> 
> Or, I could change my queries to look like this:
> 
>   -w 'title=(database of over) or description=(database of over)
>          or subject=(database of over)'
> 
> But that ends up being three searches and a bit slower especially if the
> query is complex (e.g. with wild cards).

The next version will boost performace on wild cards. 

> 
> I wonder how hard it would be to expand the query syntax so I could say:
> 
>   -w 'title,description,subject=(database of over)'
> 
> so swish would only have to read the index one time, yet check for the
> words or phrase within each meta tagged field.
> 
> Any ideas?
> 

Lot of work to do with parser!!
I think the parser has to be partially rewritten.

I will be out for three days and I cannot attend this lists. Sorry for
the
inconvenience.

Have a nice day
-- 

Jose Manuel Ruiz Ramos

jmruiz@boe.es

Jefe de Area Informatica
Boletin Oficial del Estado
Manoteras 54
Madrid 28050
Received on Tue Apr 25 12:11:48 2000