you don't say how you plan to "extract the text" of your potential
document, or how you will "run a swish-format query" on the text.
It wouldn't be very efficient, I wouldn't think, but you might just
index the doc with swish-e and then search that temp index. swish-e is
just as fast as anything else at "extracting text" and running the
query. then you could simply delete the index (or repeat for each new
doc, effectively using the same tmp index name).
example perl off the top of my head:
my $query = 'foo bar';
my %include = ();
for my $doc (@listofdocs) {
indexdoc( $doc );
if ( searchdoc( $query ) ) {
$include{$doc}++;
}
}
where indexdoc() and searchdoc() are functions that create your tmp
index and then search it. you might define a special index name to use
in your code, then remove it at end.
Masoud Pirnazar wrote on 11/17/04 9:56 PM:
> I have used Swish to index and search document collections, and now want to
> "filter" documents before indexing using the same query syntax, i.e.
>
> Given a document, I will extract its text and want to run a swish-format
> query on the text to see if it matches the query criteria; if it does, I
> will add it to my collection.
>
> The simplest method is to add everything to a collection and do a swish
> search on the collection, but I'm looking for a more efficient method,
> especially if the hit percentage is small.
>
> Can anyone suggest anything?
> I looked at the parse_swish_query and tokenize_query_string functions, but
> it gets too complicated quickly.
>
> Thanks in advance for any ideas and comments.
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Wed Nov 17 20:52:30 2004