On Wed, 16 Oct 2002, Bill Moseley wrote:
> At 06:26 PM 10/16/02 -0700, Andrew Smith wrote:
> >username=andrew and year=2002 and DNA
> >In these cases, there is no highlighting and the entire contents of the
> >file is shown (i.e., there is no cutoff to 500 characters).
> It should trim a property if not highlighted. For that matter it should
> just highlight words based on metaname.
> Without looking, it may be that the query parsing code is failing. One
> thing you might try is using a well formed query -- so instead of
> username=andrew and year=2002 and DNA
> username=(andrew) and year=(2002) and swishdefault=(DNA)
> A regular expression has to be used to break the query into groups by
> metaname (so the highlighting code knows which words go with each metaname).
> The script uses the "Parsed Words" header output from swish. So you might
> print that to stderr. Also, there's some debugging at the end of
> extract_query_match() that you can uncomment to see how it's parsing out
> your query. That also might give you an idea of what's going wrong.
> I've been saying for months that that script could use a rewrite.
I've been looking into this some more and it is due to extract_query_match
not working correctly. In looking at extract_query_match, it assumes all
metanames are given explicitly; but Swish-e doesn't require that they all
be given (if no metaname is attached to a search term, it is assumed to be
for swishdefault; e.g. "DNA sequence" == "swishdefault=(DNA and
sequence)"). Thus, queries where just search terms are given, or where
they are interspersed with other metanames, are not parsed correctly. For
"DNA sequence" --> nothing extracted
"username = andrew AND DNA AND date = 021210" --> "DNA" should be attached
to swishdefault, but
I went ahead and wrote a new version of extract_query_match which handles
cases like above. Basically, it also looks for "metaname = ..." chunks,
but for anything left that wasn't in such a chunk the search terms in them
are attached to swishdefault. I wrote it as a simple recursive-descent
parser. It solves the above cases, and seems fine on other cases too,
although I haven't done extensive tests with it. If people think it
useful, I'd like to donate it to the Swish-e community so it could be
gotten from the Swish-e website. How should I go about doing this?
Also, I wanted to build my own CGI script (with my own forms, navigation
features, result format, etc.) for searching with Swish-e, but I still
wanted to make use of the highlighting modules that come with the Swish-e
distribution. In studying the highlighting modules that come with Swish-e,
they seemed tightly integrated with the data structures inside swish.cgi,
and it seemed like it could be difficult to use them independently of
swish.cgi (which is what I wanted to do). In particular, I wanted to use
the PhraseHighlight.pm module, and I created a version of it with a
simpler interface that could be used independently of swish.cgi.
Basically, in my modified version you create the highlight object by
passing in a hash of the result header names and values (i.e., "Parsed
Words" => ..., etc.); and then you highlight text by calling "highlight",
passing in a reference to the text to highlight and the metaname whose
terms should be highlighted in that text. The code which actually does
the highlighting is the same, however. I didn't create similar versions
of SimpleHighlight.pm and DefaultHighlight.pm, but they could easily be
modified similarly. Anyway, I'd like to donate this code also if it would
be useful to the Swish-e community.
Received on Thu Jan 30 00:14:38 2003