Skip to main content.
home | support | download

Back to List Archive

Re: Alpha version Phrase Search

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Fri Apr 14 2000 - 16:44:15 GMT
Bill,

I will be out of town this weekend. So, I can not work in swish-e
for a couple of days...

Anyway, thank you for all your comments; you cannot imagine how
useful they are. 


> Ok, I bought a second Linux machine with more power (what a difference!) and have begun testing the Phrase version.
> 
> So far so good.  It seems really fast.  I've indexed about 6000 small files -- all with meta data only.
> 
> It seems to be indexing faster than the old version, too, and the index is not that much larger.  I'm going from 7.8M non-phrase (indexing in 7 minutes) to 8.5M (indexing in 4 minutes) on my index file.  Really nice.
> 
> Searching also seems faster, but I haven't measured it.  And for small searches (returning < 100 results) I doubt I can measure it as this machine is so fast ;)
> 
> The phrase searching is working well, so far.

At last, good news!!
 
> 
> 1) Why are there two settings for:
> 
> #define PHRASE_DELIMITER_CHAR '"'
> #define PHRASE_DELIMITER_STRING "\""
> 
> I also think the default should be a double quote, and not worry about it being a shell meta-character.  People shouldn't be passing queries through the shell, or at least not protected by a single quote.
>

Well, that is the way I like to program. I really like #define
clauses because they make your testing easier. And if you want a
different delimiter char ..., just change it. 
Anyway, you are right once again. " is the best delimiter, so I
probably switch to it in next package.

> Also, could the PHRASE_DELIMITER be selected by a switch passed at run time?
> 

Yes. A little of extra code to add...

> 2) I'm not sure I understand when the word position is bumped.  Say I have a sentence that contains "....searchable database of over.....".  The word "of" is a stop/common word and is not indexed.
> 
> So searching for "database of over" fails, which I think should not fail.
>

Yes, it should not fail. Have you enabled IGNORE_STOWORDS_IN_QUERY? You
need to set it and
if it does not work you have find another bug once again!! Let me take a
look at it... 

> 3) This is an old point with me, but what do you think?  Say a document contains this:  "....photographs, drawings, maps, charts.....".  Should
> 
>      swish-e -w '"photographs, drawings"'
> 
> find that phrase? I think it should, but of course swish is looking for the word "photographs," with a comma, and in my WordCharacters setting, a comma is not part of a word.  I pre-process my input to split up words like swish does so that search works on my CGI front end.  But I still think swish should work that way automatically.
> 

This also apply to 2)
Let's say "photographs" is word n, should "drawings" be word n+1 or n+2?
What do you think?
If drawings is n+1 you will find all "photographs, drawings",
"photographs drawings" and
"photographs and drawings" (remember that "and" is a stopword).
If drawings is n+2 yoou will find nothing unless the parser detects the
"," character and
looks for "photographs precd(2) drawings" (This should read as
"photographs precedes two 
positions the word drawings"). This is not implemented but may be.
So may I bump the word counter with stopwords and/or commas?

> 4) Are you bumping word count on punctuation such as ending periods?  It seems so as I can't seem to search across sentence boundaries.  In retrospect, I think that word position shouldn't be bumped in those cases -- or at least defined in swish.h.
> 
> For example, text "....food; photos...."
> 
> I can search for (not phrased): food photos
> but searching for phrase: "food; photos" or "food photos" fails.
> 
> Do other people see it that way?
> 

Yes, the word count is increased. Perhaps this may also be an option. 

> 3) I doubt many people are doing this kind of search.  The source records I'm using are made of up fields (e.g. Title, Subject, Description).  This way one can limit the search to a given area of the record.  I also have a "Keywords" meta field that is made up of all the words in Title, Subject, and Description together.  So searching in the Keywords field will find words anywhere.
> 
> The problem with phrases and this type of setup is that phrase search in Keywords can find an ending word in, say, the Subject, and a Starting word in Description.  To solve this problem I'd need a special word or symbol that I could place in my Keywords meta field that would force a bump in the word position.  Something like the period I discussed in 4) above, but something that I could be sure wouldn't appear in normal text that might bump the word count.
> 

Now, each metaName has its own counter. So phrases can also be searched
within a metaName.

> I guess it would be nice if one could specify more than one meta field to search to swish:
> 
>       swish-e -w 'title,subject,description=(food or wine)'
> 
> Then I wouldn't need a special, combined field.
> 

Perhaps in future releases. Now yo can search for.. 
swish-e -w 'title=(food or wine) or subject=(food or wine) or
description=(food or wine)'

> 4) I didn't look at the source, but I'm wondering about some program logic.
> You have this in README-PHRASE:
> 
> - Changed delimiter char form _ to \ by default in PHRASE_DELIMITER_CHAR
> and PHRASE_DELIMITER_STRING. Now you can have metanames of the form "key_1".
> 
> That made me wonder why you are looking for the phrase character while parsing metanames.  I had patched a bug before where metanames were being stemmed, and so I'm wondering if you are processing metanames like search terms.
> 

It is correct. I look for the phrase character an add an operator
between words:
"John Smith" is transformed to "John precd Smith" (This should read as
"John precedes
Smith"). So if there is a metaname between the phrase delimiter, the
metaname will be
treated like a search term.
 
> Finally, I've been looking at the code required to highlight phrases in search results.  Highlighting is going to be really tough with phrase searches due to stemming and stop words.  It might be easier to continue highlighting word-by-word....
>

Now, swish-e does not show the contents of the documents but it can do
that in a future.
For example

swish-e -d -w search -f file.index -n id -b "<strong>" -e "</strong>"

Possible new flags:
-d means get the document
-n Document number
-b begin tag for highlight 
-e end tag for highlight

Any ideas?

Jose Manuel Ruiz Ramos

jmruiz@boe.es
Received on Fri Apr 14 12:46:19 2000