> Humm... A search engine independent solution! That's an
> interesting idea. So you actually keep a 2nd file (.LST)
> for *each* pdf you include in the index?
> Of course this forces one to double the required disk space and
> maintain the "DB" (the .LST files), but the flexibility
> it provides is definitely an asset.
If you have thousands of documents, a DB could be an option.
Our DBs are often only a thousand documents (with a total
of 50000 A4-pages, full text).
> Have you considered using an off the shelf database (i.e. MySQL)
> instead of the .LST files? I'm not sure it would be a good idea,
> but the PDF files I will be indexing are huge (hundreds of MB
> each) and I am concerned with the access time of doing 2 searches
> for each user query (first in swish and then for the
> LST lookup). Any thoughts?
Suggest to use a 2-step approach. First the user performs a
Swish-E search, which gives a list of matching documents.
(response most of the time less than a second).
Next, if a user clicks to open the PDF-file, you do the
LST-lookup and return the PDF and the generated
"pseude-xml" file. Assuming you have:
1) pdf-files saved as "web optimised / linearised;
2) setup your browser correctly for Acrobat;
3) a server which has "byte ranges" support;
then this second step is also relative fast, since
even large PDF-files are served "page by page" (in
our case often only 10 KBytes per page).
Herman Knoops
KnoMan.com
Received on Thu Jun 29 03:18:57 2006