Skip to main content.
home | support | download

Back to List Archive

Re: Indexing other document types with SWISH::Filter

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Mon Aug 21 2006 - 01:33:28 GMT
andy rosbrook scribbled on 8/20/06 3:25 PM:
> Hi all, just a quick question, ive been reading the docs with regards to the SWISH::Filter and spider.pl, i've tired out the following to test indexing a pdf doc with the following command:
> 
> swish-filter-test foo.pdf foo.txt
> 
> i get the following result:
> 
> Document foo.pdf was  filtered.
>    Document:     foo.pdf  (foo.pdf)
>    Content-Type: text/html
>    Parser type:  HTML*
> 
>    >Filter used: SWISH::Filters::Pdf2HTML=HASH(0x9dd70f0) ( application/pdf -> text/html )
> ** /usr/local/bin/swish-filter-test:
>   Failed to open 'foo.txt': No such file or directory
> 
> Whats the problem here? I presume the document was filterd ok? 
> 

check the usage for swish-filter-test. What you asked it to do was filter 2 
documents: foo.pdf and foo.txt. I think you were expecting it to write the 
contents of an input (foo.pdf) to an output (foo.txt) but that's not what 
swish-filter-test does.

see perldoc swish-filter-test

You might have meant:

  $ swish-filter-test -content foo.pdf > foo.txt

which would print the content of the converted/filtered foo.pdf to stdout.

> On another note, is there anything that needs to be included in the spider config to get the SWISH::Filter working for pdf documents? Or is it automatic?
>

I believe it is automatic, as long as swish-filter-test works (which in your 
case it appears to). But be sure to read

  perldoc spider.pl

to be sure...

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Sun Aug 20 18:33:33 2006