Skip to main content.
home | support | download

Back to List Archive

Re: swish.cgi and metaname search in query.

From: Andrew Smith <asmith(at)not-real.compbio.berkeley.edu>
Date: Thu Jan 30 2003 - 00:14:17 GMT
On Wed, 16 Oct 2002, Bill Moseley wrote:

> At 06:26 PM 10/16/02 -0700, Andrew Smith wrote:
> >username=andrew
> >
> >username=andrew and year=2002 and DNA
> >
> >In these cases, there is no highlighting and the entire contents of the 
> >file is shown (i.e., there is no cutoff to 500 characters).
> 
> It should trim a property if not highlighted.  For that matter it should
> just highlight words based on metaname.
> 
> Without looking, it may be that the query parsing code is failing.  One
> thing you might try is using a well formed query -- so instead of 
> 
>     username=andrew and year=2002 and DNA
> 
> use
> 
>     username=(andrew) and year=(2002) and swishdefault=(DNA)
> 
> A regular expression has to be used to break the query into groups by
> metaname (so the highlighting code knows which words go with each metaname).
> 
> The script uses the "Parsed Words" header output from swish.  So you might
> print that to stderr.  Also, there's some debugging at the end of
> extract_query_match() that you can uncomment to see how it's parsing out
> your query.  That also might give you an idea of what's going wrong.
> 
> 
> I've been saying for months that that script could use a rewrite.

I've been looking into this some more and it is due to extract_query_match 
not working correctly. In looking at extract_query_match, it assumes all 
metanames are given explicitly; but Swish-e doesn't require that they all 
be given (if no metaname is attached to a search term, it is assumed to be 
for swishdefault; e.g. "DNA sequence" == "swishdefault=(DNA and 
sequence)"). Thus, queries where just search terms are given, or where 
they are interspersed with other metanames, are not parsed correctly. For 
example:

"DNA sequence" --> nothing extracted

"username = andrew AND DNA AND date = 021210" --> "DNA" should be attached 
                                                   to swishdefault, but 
                                                   isn't

I went ahead and wrote a new version of extract_query_match which handles 
cases like above. Basically, it also looks for "metaname = ..." chunks, 
but for anything left that wasn't in such a chunk the search terms in them 
are attached to swishdefault. I wrote it as a simple recursive-descent 
parser. It solves the above cases, and seems fine on other cases too, 
although I haven't done extensive tests with it. If people think it 
useful, I'd like to donate it to the Swish-e community so it could be 
gotten from the Swish-e website. How should I go about doing this?

Also, I wanted to build my own CGI script (with my own forms, navigation 
features, result format, etc.) for searching with Swish-e, but I still 
wanted to make use of the highlighting modules that come with the Swish-e 
distribution. In studying the highlighting modules that come with Swish-e, 
they seemed tightly integrated with the data structures inside swish.cgi, 
and it seemed like it could be difficult to use them independently of 
swish.cgi (which is what I wanted to do). In particular, I wanted to use 
the PhraseHighlight.pm module, and I created a version of it with a 
simpler interface that could be used independently of swish.cgi. 
Basically, in my modified version you create the highlight object by 
passing in a hash of the result header names and values (i.e., "Parsed 
Words" => ..., etc.); and then you highlight text by calling "highlight", 
passing in a reference to the text to highlight and the metaname whose 
terms should be highlighted in that text. The code which actually does 
the highlighting is the same, however. I didn't create similar versions 
of SimpleHighlight.pm and DefaultHighlight.pm, but they could easily be 
modified similarly. Anyway, I'd like to donate this code also if it would 
be useful to the Swish-e community.

-Andrew Smith
Received on Thu Jan 30 00:14:38 2003