there might be possible solution.
I would do this
0.step - index in random order (see below)
1.step - make usual query (X and Y not Z)
2.step - make second query (X and Y not Z), asking for categories
*only*, sorted randomly
- if the results are lower than certain treshold (depends on the size
of collection), than count everything - i.e. every category
- if more than treshold, get only the sample and count the sample
you must only make sure that results are returned in *random* order,
this is probably a little tricky, but you could assigning random
strings/numbers during indexing and return the same query results (step
1) in random order (step 2)
thus you don't need to loop over 100000 results, but only over 1000 or
so and the numbers should be accurate enough - however you will not get
absolute counts, but relative counts
a.b 15% a -> 25%
b 55% b -> 55%
c.a 10% c -> 12%
d 8% d -> 8%
of course, under the certain level of probability - but who minds that?
if the sample is big enough (and random) it resembles whole population
very closely. ANd your categories should serve as a hint for further
searching, do they?
I don't know if it will be fast enough, swish-e might be also hacked to
return results in random order (?).
Net Virtual Mailing Lists napsal(a):
>>if I understand it, Greg would like to have something as browsable index
>>of categories (at least, something that summarizes categories)
>>I was trying to do something similar, look for example here (testing)
>>it is an external script, that simply counts the number of occurences
>>for later browsing/searching
>>I think this information can be collected from swish-e index too,
>>something like dumping metadata out of the index and then counting it
>>however, we would need an ability to dump only certain parts of index,
>>sounds that normal?
> I think you understand what I am after here. :)
> Except in the example you gave it would be:
> a 45
> a.b 20
> a.c 25
> . every upper level category's count is a sum of its sibling counts.
> For a bit of theoretical thought on this:
> Imagine if I indexed 1 million files which fall into 200 categories. Now
> imagine if a search result across all 1 million documents returns 100,000
> of them. For the main page I want to display, based on that result,
> simply a count of how many documents fall into each category. This
> would require having a script iterate through a loop 100,000 times, when
> it seems as if this could be handle *very* efficiently inside a search
> engine, especially with the way Swish-E seems to have been designed (e.g.
> property values). It strikes me that Swish-E is spending extra work to
> give me all these results and then I'm spending extra work in an external
> script to process the results. Theoretically speaking am I completely
> wrong here? If not, how hard would be it be to do this and could it be
> added to a TODO list somewhere? If I am wrong, sorry for beating this
> dead horse.
> As for the results page I would add to the search query whichever
> category has currently been selected, reducing the number of returned
> results to a much smaller number.
> I have written a script to do this and while the performance is adequate,
> it is no better then querying against Postgres directly. I pick up some
> performance when executing a query inside a specific category, but I've
> not seen any improvement in the "summary" query when compared against
> I am sorry, I wen tot the URL you have listed above, but I just can't
> tell what it is I am looking at (probably a language thing).. :)
> - Greg
Received on Mon Jul 11 23:55:23 2005