Quick google Mr. December 18??? ;)
What i **really** want to know is if you googled caching or googled
yourself.
Having access to the swishdbfile property will take me a long way
by allowing me to push the stemmed results to the back. then i can do
something like ask swish for the hit counts from each index, and
know that if i am sorting the combined set with stems last, i limit with
the knowledge that my stem results won't even start until result numer
3000 or whatever.
i used to store my resultsets as blobs in mysql which didn't scale so well
and at some point went back to relying on swish-e's internals as they were
so much easier to work with. at about the same time i went from returning
a limited set of props as you mentioned, which i in passed to mysql for
the extended info, to adding my properties to the index.
But i haven't looked at moving to in memory or disk caching.
My last question Bill, and as always thanks for the time, i know that in
the case of html docs, swish assigns value to the imortance of elements
(title,body, etc). If i use xml elements with the same values does it
consider them the same? I noticed that the next version has something
called MetaNamesRank, which seems like it will allow me to do this via
config.
Brad
------------------------------------------------------------
Brad Miele
Technology Director
IPNStock
(866) 476-7862 x902
bmiele@ipnstock.com
Imagination is the one weapon in the war against reality.
-- Jules de Gaultier
On Thu, 4 Nov 2004, Bill Moseley wrote:
> On Wed, Nov 03, 2004 at 02:59:51PM -0800, Brad Miele wrote:
> > along these lines, and since you were the sucker, err kind soul who
> > responded first, do you know if there is a way to force a meta for every
> > record based on config?
> >
> > basically, if i am indexing /xmldocs, once with stem and once without, i
> >
> > would like to set a sort value of 0 for the non-stemmed and 1 for the
> > stemmed, so when i sorted the results
>
> You want a way to identify which index a give result comes from? You
> don't mean metanames, you mean a property name -- 'swishdbfile' is
> the name.
>
> Here's some other ideas:
>
> You might set a limit of how many results you will process. Who is
> going to page through 10,000 results? So that can limit the
> processing requirements.
>
> You can then read in all your results and do your processing then use
> Perl's Storable to dump the result set to disk. A quick google found
> one discussion on caching:
>
> http://mathforum.org/epigone/modperl/dwimpblelkox
>
> I've done this before with Cache::FileCache (or something like that
> that handles the cache management) and then on following requests (for
> like "next page") use the cached data. It was faster than having
> swish run the same query again.
>
> Managing the cache from within swish would likely be faster, but quite
> a bit more work. But, it would be nice to be able to store the
> "result set" that swish maintains internally and then reload that.
> That would avoid having to read all the properties that you are
> interested in (important if you have a lot of data in your props).
>
> When I did that in the past I was only using swish to return the file
> name and two small properties -- no description. So, swish-e returned
> all hits (well, up to some pre-defined limit of a few thousand), then
> I would do a bit of processing in perl, display the first result set,
> fork and close stdout (to allow the web to finish the connection) and
> then I'd write the result set to disk.
>
> Inside swish you can probably do quite a bit. First, the index file
> is known at search time, so you could without much trouble alter the
> ranking based on the index file. It would be hard because the stemmed
> indexes will likely have more word hits. Still, it's something you
> could do with a bit of hacking.
>
> Also, if you are careful on indexing (that is, create an output file
> with -S prog and then created both your indexes from that file then
> the file numbers should match up. So, after sorting you could malloc
> an array the size of the result set and then weed out the duplicates
> based on file number. That avoids the need to read the property file
> for each result's file name and then swish will report an accurate
> total number of hits.
>
> Likely not as fast, but another way would be to sort results by file
> number, then walk the results removing the duplicates. Then resort.
> That could be done either in swish or in post processing in your
> script.
>
>
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
> http://swish-e.org/Discussion/
>
> Help with Swish-e:
> http://swish-e.org/current/docs
> swish-e@sunsite.berkeley.edu
>
>
Received on Thu Nov 4 10:54:13 2004