Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e not indexing doc or PDF files

From: Liam Buchanan <Liam.Buchanan(at)not-real.dtrdi.qld.gov.au>
Date: Tue Feb 12 2008 - 23:43:32 GMT
Hi,

I am using spider.pl to crawl. I have only 1 pdf on the entire intranet
as a test. I have tried both the domain and ip in the hyperlink.
I did some extensive testing yesterday. The strange thing is if I use
pdftotext or pdftohtml directly on a local file then it generates the
output correctly.
It seems to have a big problem opening the pdf after running swish-e.
this same pdf can be opened directly from a browser (as a binary file)
and as stated before it opens when directly applying pdftotext and
pdftohtml in cmd.
Heres the pdftohtml error:

 (523 words)
http://*****.au/dsdweb/v4/apps/web/secure/docs/25.pdf - Using HTML
2 parser - Error: Couldn't open file ''http://*****.au/dsdweb/v4/a
pps/web/secure/docs/25.pdf''
 (no words indexed)

Also I am not sure how to turn on the -T debugging - can you assist me
with this.
Verbose is active.

Thanks.
Liam.




 

-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of Peter Karman
Sent: Tuesday, 12 February 2008 1:01 PM
To: Swish-e Users Discussion List
Subject: Re: [swish-e] Swish-e not indexing doc or PDF files



Liam Buchanan wrote on 2/11/08 6:26 PM:
> Hi,
> Hope someone can suggest a solution to this frustrating problem.
> We are running swish-e on our development server that indexes our 
> production intranet server. However the problem lies in the inability 
> for the indexing to process .doc or PDF files. When the search reaches

> a hyperlink that is linked to a PDF or doc file the process halts and 
> the error message is produced below (under output)  Before running 
> swish-e, we connect to our production server via a proxy connection 
> first (ntlmaps)

it isn't clear to me how you are aggregating your documents. spider.pl ?
Some other crawler?

The FileFilter config can work at odds with the SWISH::Filter stuff in
spider.pl, effectively trying to convert non-text files 2x.

Try indexing one, troublesome, document. Break down the process:
fetching the doc, feeding it to swish-e, etc. Turn on verbosity and the
-T debugging options.

--
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users

------------------------------------------------------------------------
----
Unless stated otherwise, this email, together with any attachments, is
intended for the named recipient(s) only and may contain privileged and
confidential information. If received in error, you are asked to inform
the sender as quickly as possible and delete this email and any copies
of this from your computer system network. 

If not an intended recipient of this email, you must not copy,
distribute or take any action(s) that relies on it; any form of
disclosure, modification, distribution and/or publication of this email
is also prohibited. 

Unless stated otherwise, this email represents only the views of the
sender and not the views of the Queensland Government. 
------------------------------------------------------------------------
----
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 12 18:56:58 2008