On Thu, Oct 28, 2004 at 10:13:31AM -0400, Antonio Barrera wrote:
> Would this apply similarly to using xpdf to parse PDF docs?
> IndexContents HTML* .htm .html .shtml .php
> IndexContents TXT* .txt .log .text .pdf
> IndexContents XML* .xml
> StoreDescription TXT* 10000
> StoreDescription HTML* <body>
Maybe. Depends on how the PDF files are indexed. If you are using
spider.pl (with SWISH::Filter) then the document type is passed
directly to swish:
$ spider.pl default http://localhost/apache/test.pdf 2>/dev/null | head -5
So that tells swish what type of file is being indexed:
$ spider.pl default http://localhost/apache/test.pdf 2>/dev/null | swish-e -v9 -i stdin -S prog
Indexing Data Source: "External-Program"
http://localhost/apache/test.pdf - Using HTML2 parser - (2301 words)
See how it says using HTML2 parser. Now if you just index a file
without telling the parser type it says:
$ swish-e -i 1.html -v9
Indexing Data Source: "File-System"
Checking file "1.html"...
1.html - Using DEFAULT (HTML2) parser - (12 words)
So it's saying "DEFAULT" there.
If you are not using spider.pl or some -S prog program that passes in
the Document-Type: header then, yes, you would need to use
DefaultContents or IndexContents to set the content type.
I guess the reasoning is that storedescription works differently for
different types of documents, so it needs to be told what the document
Here's my comment from many years ago:
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Thu Oct 28 07:29:38 2004