On Thu, Sep 04, 2003 at 02:14:51AM -0700, William Bailey wrote:
> "For the FAQ we just need some general search score info rather than anything
> Now apart from saying "The most relevant should have a higher score." i don't
> exactly know what to say.
> Now the data that is being searched is both large and has a lot of meta
> fields defined so how will this affect the score? If required i can post
> sample data as well as config files.
I suspect this is in the list archives some place.
No analysis is done to determine what are the "keywords" in a document
-- it's just word frequency that sets the rank.
Look at rank.c. The rank is a rather simplistic calculation. Rank is
done for each word in the query. For HTML there's a point value for
where a word is i.e. in <title> or <b>, etc. (non-HTML docs are have the
same point value) and there is also a point value based on what meta
name is used (rank bias, and defaults to zero) and the total rank is
just the sum of the point values. The log() is taken of this value to
try and limit the effect of very large documents.
That seems to work reasonably well for smallish sets of documents that
are similar in size (such as a collection of web pages).
I spent one day searching a somewhat larger collection (60,000 docs) and
comparing it with the same search on google. Google's page rank was
clearly having a strong impact on search results because documentation
and tutorial type pages were often listed first.
Anyway, swish tended to return very large documents first just due to
the number of search term hits. Swish had indexed smaller web pages but
also mailing list archives where a single mbox file could be a few
I tried a number of changes to rank.c and didn't really see much change
until I then just limited the word frequency to 100 (a totally arbitrary
number) and then those huge documents didn't effect results so much.
Results were a lot closer to Google's.
I didn't expect such a simplistic method to stay in the code, but
testing with a few other collections of documents I had similar results.
[Jose, with that system we could instead limit word frequency on
indexing and reduce index size, perhaps.]
I've often thought word position would also be a good thing to add into
the calculation. If trying to find a document about something (instead
of a document that contains something) the terms found early in the
document should be worth more.
AND and OR results just combine ranks. AND does a running average where
OR, IIRC, sums them up so that foo OR bar should rank docs with both
higher (but again, depends on term frequency).
AND searches really should also adjust rank based on how close words are
together as people often search for phrase without using quotes.
I've asked for help on this list and other places for help with rank
redesign. Would be a great project for a graduate student.
> I know the use is probably not what swish was designed for but it does the
> job well although the only feature I'm missing is to search for a range of
> values. I know it can be done with the -L but that only applies to properties
> and therefore 1 value per file which is not enough for my requirements as i
> would like to order the results by a field that could occur more then once
> i.e. release dates. Anyway before i get even more off topic :)
Searching for a range of values is really more a database function. I
don't really like the -L feature as it's not very scalable. It just
takes all the documents sorted by that property, inverts that table so
swish-e can lookup by file number and then consults that table to filter
> For reference here is a typical query along with swish output...
> User searches for:
> * artist: "Black Sabbath"
> * include compilation recordings in artist search.
> * track: iron man
> * format: CD
> * order: Search relevance (highest » lowest)
Looks more like a database select than a full-text search.
> The following command get run:
> /usr/local/bin/swish-e -H 9 -d\\t -w '( ( recording.artist.main=( black
> sabbath ) OR recording.track.artist.main=( black sabbath ) OR
> recording.artist.main.md5=(b1dd10efa6a2761536d12edc20edeca9) OR
> recording.track.artist.main.md5=(b1dd10efa6a2761536d12edc20edeca9) ) AND
> recording.track.title=(iron man) AND recording.media.available.group=( -cd-
> ) AND recording.available=( yes ) AND recording.chanel=(musicmaster) )'
> -s swishrank desc recording.title asc recording.artist.main asc -b 0 -m 3000
> -f /usr/home/wb/Web/Work/red-phase3/_server/data/swish/data.index
Each AND (including the default AND operator) and OR operation is a new
search. So reducing the boolean searches would be good for speed.
Are you using the md5 keys for exact matches? We have talked about
setting flags on the first and last words indexed in a metaname so you
could do a phrase search for "Black Sabbath" where "Black" was the first
word indexed and "Sabbath" the last, i.e. the metaname is exactly "Black
Received on Thu Sep 4 15:17:14 2003