Skip to main content.
home | support | download

Back to List Archive

Re: Combining stem/non stem removing dups in perl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Nov 04 2004 - 18:13:14 GMT
On Wed, Nov 03, 2004 at 02:59:51PM -0800, Brad Miele wrote:
> along these lines, and since you were the sucker, err kind soul who
> responded first, do you know if there is a way to force a meta for every
> record based on config?
> 
> basically, if i am indexing /xmldocs, once with stem and once without, i
> 
> would like to set a sort value of 0 for the non-stemmed and 1 for the
> stemmed, so when i sorted the results

You want a way to identify which index a give result comes from?  You
don't mean metanames, you mean a property name -- 'swishdbfile' is
the name.

Here's some other ideas:

You might set a limit of how many results you will process.  Who is
going to page through 10,000 results?  So that can limit the
processing requirements.  

You can then read in all your results and do your processing then use
Perl's Storable to dump the result set to disk.  A quick google found
one discussion on caching:

  http://mathforum.org/epigone/modperl/dwimpblelkox

I've done this before with Cache::FileCache (or something like that
that handles the cache management) and then on following requests (for
like "next page") use the cached data.  It was faster than having
swish run the same query again.

Managing the cache from within swish would likely be faster, but quite
a bit more work.  But, it would be nice to be able to store the
"result set" that swish maintains internally and then reload that.
That would avoid having to read all the properties that you are
interested in (important if you have a lot of data in your props).

When I did that in the past I was only using swish to return the file
name and two small properties -- no description.  So, swish-e returned
all hits (well, up to some pre-defined limit of a few thousand), then
I would do a bit of processing in perl, display the first result set,
fork and close stdout (to allow the web to finish the connection) and
then I'd write the result set to disk.

Inside swish you can probably do quite a bit.  First, the index file
is known at search time, so you could without much trouble alter the
ranking based on the index file.  It would be hard because the stemmed
indexes will likely have more word hits.  Still, it's something you
could do with a bit of hacking.

Also, if you are careful on indexing (that is, create an output file
with -S prog and then created both your indexes from that file then
the file numbers should match up.  So, after sorting you could malloc
an array the size of the result set and then weed out the duplicates
based on file number.  That avoids the need to read the property file
for each result's file name and then swish will report an accurate
total number of hits.

Likely not as fast, but another way would be to sort results by file
number, then walk the results removing the duplicates.  Then resort.
That could be done either in swish or in post processing in your
script.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Nov 4 10:13:16 2004