Skip to main content.
home | support | download

Back to List Archive

Re: Alpha version Phrase Search

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Apr 14 2000 - 14:46:48 GMT
Jose,

Ok, I bought a second Linux machine with more power (what a difference!) and have begun testing the Phrase version.

So far so good.  It seems really fast.  I've indexed about 6000 small files -- all with meta data only.

It seems to be indexing faster than the old version, too, and the index is not that much larger.  I'm going from 7.8M non-phrase (indexing in 7 minutes) to 8.5M (indexing in 4 minutes) on my index file.  Really nice.

Searching also seems faster, but I haven't measured it.  And for small searches (returning < 100 results) I doubt I can measure it as this machine is so fast ;)

The phrase searching is working well, so far.

Great work!

I have some general questions/comments:

1) Why are there two settings for:

#define PHRASE_DELIMITER_CHAR '"'
#define PHRASE_DELIMITER_STRING "\""

I also think the default should be a double quote, and not worry about it being a shell meta-character.  People shouldn't be passing queries through the shell, or at least not protected by a single quote.

Also, could the PHRASE_DELIMITER be selected by a switch passed at run time?  

2) I'm not sure I understand when the word position is bumped.  Say I have a sentence that contains "....searchable database of over.....".  The word "of" is a stop/common word and is not indexed.

So searching for "database of over" fails, which I think should not fail.

3) This is an old point with me, but what do you think?  Say a document contains this:  "....photographs, drawings, maps, charts.....".  Should

     swish-e -w '"photographs, drawings"' 

find that phrase? I think it should, but of course swish is looking for the word "photographs," with a comma, and in my WordCharacters setting, a comma is not part of a word.  I pre-process my input to split up words like swish does so that search works on my CGI front end.  But I still think swish should work that way automatically.

4) Are you bumping word count on punctuation such as ending periods?  It seems so as I can't seem to search across sentence boundaries.  In retrospect, I think that word position shouldn't be bumped in those cases -- or at least defined in swish.h.

For example, text "....food; photos...."

I can search for (not phrased): food photos
but searching for phrase: "food; photos" or "food photos" fails.

Do other people see it that way?


3) I doubt many people are doing this kind of search.  The source records I'm using are made of up fields (e.g. Title, Subject, Description).  This way one can limit the search to a given area of the record.  I also have a "Keywords" meta field that is made up of all the words in Title, Subject, and Description together.  So searching in the Keywords field will find words anywhere.

The problem with phrases and this type of setup is that phrase search in Keywords can find an ending word in, say, the Subject, and a Starting word in Description.  To solve this problem I'd need a special word or symbol that I could place in my Keywords meta field that would force a bump in the word position.  Something like the period I discussed in 4) above, but something that I could be sure wouldn't appear in normal text that might bump the word count.

I guess it would be nice if one could specify more than one meta field to search to swish:

      swish-e -w 'title,subject,description=(food or wine)'

Then I wouldn't need a special, combined field.


4) I didn't look at the source, but I'm wondering about some program logic.
You have this in README-PHRASE:

- Changed delimiter char form _ to \ by default in PHRASE_DELIMITER_CHAR 
and PHRASE_DELIMITER_STRING. Now you can have metanames of the form "key_1".

That made me wonder why you are looking for the phrase character while parsing metanames.  I had patched a bug before where metanames were being stemmed, and so I'm wondering if you are processing metanames like search terms.


Finally, I've been looking at the code required to highlight phrases in search results.  Highlighting is going to be really tough with phrase searches due to stemming and stop words.  It might be easier to continue highlighting word-by-word....



Bill Moseley
mailto:moseley@hank.org
Received on Fri Apr 14 10:50:18 2000