Bill Moseley wrote:
> At 10:43 AM 04/24/00 -0700, Jose Manuel Ruiz wrote:
> >More about stop words...
> >In config.h you can find the following line:
> >#define IGNORE_STOPWORDS_IN_QUERY 1
> Oh, I wonder if there isn't a problem with the SRE's change in that routine
> in search.c? I'm getting a segfault:
> For example, a document that contains this:
> "...with a searchable database of over 5,000 recipes..."
> Searching for (non-phrase search)
> > ../swish-e -w 'keywords=(database of over)'
If you do not have change PHRASE_DELIMITER_CHAR you
./swish-e -w 'keywords="database of over"'
The modification to the parser is very simple. It looks
for the delimiter char and change the search to:
keywords=(database precd of precd over)
Read it as: database precedes of precedes over
I do not know if this is the problem but, at least, a
parse error may be displayed.
The search is solved like an and but considering positions.
Anyway, I will check it.
> >So, I am wondering if IGNORE_STOPWORDS_IN_QUERY has any sense now.
> >It always has to be enabled!!
> Exactly. I can't see any reason that should be an option. I wonder if
> that was added as a step to fix the broken search.c logic.
Perhaps, I will remove it in next version.
> >> Plus, I really think that swish should parse text on searching exactly like
> >> it does on indexing. Otherwise, it is very confusing as you can't search
> >> for text cut directly from the source document and expect it to work. That
> >> means the wordchars, ignore first and last, and other settings would need
> >> to be saved in the index file (just like the Use Stemming: setting).
> >Yes, it should work that way. But this can be a major change. Let me
> >look at the code... There are other things that may be also included in the
> >index file.
> I had that working at one point -- well kind of working -- but my C skills
> are not good enough to really get it right. But it was a hack.
> The problems I had were I had to read the swish.conf file on every search
> to get Wordcharacters and other settings used to determine what defines a
> "swish" word. The solution is to put those settings in the index file header.
> The second problem I had was with multiple indexes. I needed to rewrite
> the logic so the parsing of query words was done on a per-index file basis
> instead of just at the start of search.c. This is because different index
> files could have different settings used to define "swish" words.
You are right
> Frankly, the entire search.c parsing always has bugged me. It's full of
> hacks now that look for wild cards, or make exceptions if a meta tag name
> is found. Seems like the query needs to be somehow parsed into a better
> syntax tree, but that's way above my skills.
You are right once again. At this point (not released yet) I have
expandstar and getmatchword and add the "wild card" functionality to
because it is faster and more clear. Now, when you search for "a*" you
find the word "and"!!
> But I need some ideas on how to solve this problem:
> Say I have three meta fields: "title", "description", and "subject".
> I concatenate the three into one field "keywords". This means I can use
> swish to search any single field, or, by using "keywords" I can search all
> fields at once (as in my gdb example above). But that has the problem that
> a phrase can span meta fields when searching "keywords".
> One ugly solution would be for me to add some non-word when concatenating
> into one field so phrase would never span fields.
> Or, I could change my queries to look like this:
> -w 'title=(database of over) or description=(database of over)
> or subject=(database of over)'
> But that ends up being three searches and a bit slower especially if the
> query is complex (e.g. with wild cards).
The next version will boost performace on wild cards.
> I wonder how hard it would be to expand the query syntax so I could say:
> -w 'title,description,subject=(database of over)'
> so swish would only have to read the index one time, yet check for the
> words or phrase within each meta tagged field.
> Any ideas?
Lot of work to do with parser!!
I think the parser has to be partially rewritten.
I will be out for three days and I cannot attend this lists. Sorry for
Have a nice day
Jose Manuel Ruiz Ramos
Jefe de Area Informatica
Boletin Oficial del Estado
Received on Tue Apr 25 12:11:48 2000