Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e not indexing doc or PDF files

From: Liam Buchanan <Liam.Buchanan(at)not-real.dtrdi.qld.gov.au>
Date: Thu Feb 28 2008 - 05:10:24 GMT
I am using xpdf but I don't think that is the issue.
I tried a test of a html page with a link (direct file link) to the pdf
document and the spider was able to index it correctly and write to a
html document.
When I attempt to specify a url as the link to the pdf, the message is
'can't open file' even though the file is accessible through a browser
via the same url.
One thing I did notice is when instead of including the domain in the
url link, I included the IP address - the mime type reference in the cmd
output stated '???' Instead of the usual 'application/pdf'.
It seems a shame to replace swish-e with another product because this is
the last stumbling block we have but it seems we have no alternative to
replace it.

Can someone suggest another program we could use in a windows
environment?

Thanks again

 

-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of William M Conlon
Sent: Wednesday, 27 February 2008 5:21 PM
To: Swish-e Users Discussion List
Subject: Re: [swish-e] Swish-e not indexing doc or PDF files

are you using xPDF?  Adobe keeps changing the format of the pdf file,
and as I recall, xPDF will not read the latest versions of Adobe PDF
documents.  We save all of our pdfs as version 4 or 5 I think.

Bill


On Feb 26, 2008, at 10:17 PM, Liam Buchanan wrote:

> Hi,
> I just tried indexing a pdf from a url (.cfm page) link on the local 
> machine and got these errors:
>
> Accept-Ranges: bytes
> ETag: "0315492fcfec21:58ec"
> Server: Microsoft-IIS/5.0
> Content-Length: 2660012
> Content-Type: application/pdf
> Last-Modified: Thu, 10 Apr 2003 01:00:26 GMT
> Client-Date: Wed, 27 Feb 2008 06:14:24 GMT
> Client-Peer: 127.0.0.1:5865
> Client-Response-Num: 1
> X-Powered-By: ASP.NET
>
> ^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>>> +Fetched 1 Cnt: 2 GET
> http://172.16.100.241/dsdweb/v3/guis/templates/content
> /Errmsg.pdf  200 OK application/pdf 2660012 
> parent:http://172.16.100.241/dsdweb/
> v3/guis/templates/content/testpage.cfm depth:1 
> http://172.16.100.241/dsdweb/v3/guis/templates/content/testpage.cfm - 
> Using HTML
> 2 parser -  (54 words)
> http://172.16.100.241/dsdweb/v3/guis/templates/content/Errmsg.pdf - 
> Using HTML2 parser - Error (0): PDF file is damaged - attempting to 
> reconstruct xref table..
> .
> Error: Top-level pages object is wrong type (null)
> Error: Couldn't read page catalog
>  (no words indexed)
>
> -------------
>
> Can anyone suggest the issue here?
>
> Thanks !!!
>
>
>
> -----Original Message-----
> From: users-bounces@lists.swish-e.org
> [mailto:users-bounces@lists.swish-e.org] On Behalf Of Peter Karman
> Sent: Saturday, 23 February 2008 3:45 AM
> To: Swish-e Users Discussion List
> Subject: Re: [swish-e] Swish-e not indexing doc or PDF files
>
>
>
> On 02/20/2008 07:50 PM, Liam Buchanan wrote:
>> Hi,
>> Thanks for the information.
>> I tried to do a trace but it didn't come up with anything unusual.
>>
>> Below is my spider.pl file conf
>> Please let me know if there is anything in there I am missing or 
>> should be taken out. The proxy reference needs to be in there for it
> to work.
>
>
> I guess I'm not following you. The example you gave works? What is a 
> 'trace'?
>
> --
> Peter Karman  .  peter(at)not-real.peknet.com  .  http://peknet.com/
>
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
>
> ----------------------------------------------------------------------
> ------
> Unless stated otherwise, this email, together with any attachments, is

> intended for the named recipient(s) only and may contain privileged 
> and confidential information. If received in error, you are asked to 
> inform the sender as quickly as possible and delete this email and any

> copies of this from your computer system network.
>
> If not an intended recipient of this email, you must not copy, 
> distribute or take any action(s) that relies on it; any form of 
> disclosure, modification, distribution and/or publication of this 
> email is also prohibited.
>
> Unless stated otherwise, this email represents only the views of the 
> sender and not the views of the Queensland Government.
> ----------------------------------------------------------------------
> ------
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
---------------------------------------------------------------------------- 
Unless stated otherwise, this email, together with any attachments, is 
intended for the named recipient(s) only and may contain privileged and 
confidential information. If received in error, you are asked to inform the 
sender as quickly as possible and delete this email and any copies of this 
from your computer system network. 

If not an intended recipient of this email, you must not copy, distribute or 
take any action(s) that relies on it; any form of disclosure, modification, 
distribution and/or publication of this email is also prohibited. 

Unless stated otherwise, this email represents only the views of the sender 
and not the views of the Queensland Government. 
----------------------------------------------------------------------------

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Feb 28 00:13:02 2008