Skip to main content.
home | support | download

Back to List Archive

Re: Combining stem/non stem removing dups in perl

From: <brad(at)not-real.auroraquanta.com>
Date: Thu Nov 04 2004 - 18:54:12 GMT
Quick google Mr. December 18??? ;)

What i **really** want to know is if you googled caching or googled
yourself.

Having access to the swishdbfile property will take me a long way
by allowing me to push the stemmed results to the back. then i can do
something like ask swish for the hit counts from each index, and
know that if i am sorting the combined set with stems last, i limit with
the knowledge that my stem results won't even start until result numer
3000 or whatever.

i used to store my resultsets as blobs in mysql which didn't scale so well
and at some point went back to relying on swish-e's internals as they were
so much easier to work with. at about the same time i went from returning
a limited set of props as you mentioned, which i in passed to mysql for
the extended info, to adding my properties to the index.

But i haven't looked at moving to in memory or disk caching.

My last question Bill, and as always thanks for the time, i know that in
the case of html docs, swish assigns value to the imortance of elements
(title,body, etc). If i use xml elements with the same values does it
consider them the same? I noticed that the next version has something
called MetaNamesRank, which seems like it will allow me to do this via
config.

Brad
------------------------------------------------------------
 Brad Miele
 Technology Director
 IPNStock
 (866) 476-7862 x902
 bmiele@ipnstock.com

 Imagination is the one weapon in the war against reality.
		-- Jules de Gaultier


On Thu, 4 Nov 2004, Bill Moseley wrote:

> On Wed, Nov 03, 2004 at 02:59:51PM -0800, Brad Miele wrote:
> > along these lines, and since you were the sucker, err kind soul who
> > responded first, do you know if there is a way to force a meta for every
> > record based on config?
> >
> > basically, if i am indexing /xmldocs, once with stem and once without, i
> >
> > would like to set a sort value of 0 for the non-stemmed and 1 for the
> > stemmed, so when i sorted the results
>
> You want a way to identify which index a give result comes from?  You
> don't mean metanames, you mean a property name -- 'swishdbfile' is
> the name.
>
> Here's some other ideas:
>
> You might set a limit of how many results you will process.  Who is
> going to page through 10,000 results?  So that can limit the
> processing requirements.
>
> You can then read in all your results and do your processing then use
> Perl's Storable to dump the result set to disk.  A quick google found
> one discussion on caching:
>
>   http://mathforum.org/epigone/modperl/dwimpblelkox
>
> I've done this before with Cache::FileCache (or something like that
> that handles the cache management) and then on following requests (for
> like "next page") use the cached data.  It was faster than having
> swish run the same query again.
>
> Managing the cache from within swish would likely be faster, but quite
> a bit more work.  But, it would be nice to be able to store the
> "result set" that swish maintains internally and then reload that.
> That would avoid having to read all the properties that you are
> interested in (important if you have a lot of data in your props).
>
> When I did that in the past I was only using swish to return the file
> name and two small properties -- no description.  So, swish-e returned
> all hits (well, up to some pre-defined limit of a few thousand), then
> I would do a bit of processing in perl, display the first result set,
> fork and close stdout (to allow the web to finish the connection) and
> then I'd write the result set to disk.
>
> Inside swish you can probably do quite a bit.  First, the index file
> is known at search time, so you could without much trouble alter the
> ranking based on the index file.  It would be hard because the stemmed
> indexes will likely have more word hits.  Still, it's something you
> could do with a bit of hacking.
>
> Also, if you are careful on indexing (that is, create an output file
> with -S prog and then created both your indexes from that file then
> the file numbers should match up.  So, after sorting you could malloc
> an array the size of the result set and then weed out the duplicates
> based on file number.  That avoids the need to read the property file
> for each result's file name and then swish will report an accurate
> total number of hits.
>
> Likely not as fast, but another way would be to sort results by file
> number, then walk the results removing the duplicates.  Then resort.
> That could be done either in swish or in post processing in your
> script.
>
>
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
>    http://swish-e.org/Discussion/
>
> Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
>
>
Received on Thu Nov 4 10:54:13 2004