Skip to main content.
home | support | download

Back to List Archive

Re: pdftotext - erroring out

From: intervolved none <intervolved(at)not-real.yahoo.com>
Date: Thu Oct 24 2002 - 20:33:54 GMT
--0-723580087-1035490812=:78545
Content-Type: multipart/alternative; boundary="0-1517464465-1035490812=:78545"

--0-1517464465-1035490812=:78545
Content-Type: text/plain; charset=us-ascii


>Well then you are not trying the right file. ;)>

>That error message is from either pdftotext or pdftoinfo. I've had
>similar problems and it was a matter of finding a way to show me which
>file was the problem, as I explained.

I have tested swish-e by indexing the files using -fs and -http.  The pdf will be indexed fine if I use -fs.  If I try to index it by using -http it will not and I will get the error message that the PDF file is damaged.  In both methods I am indexing the same file.  

>That error message is from either pdftotext or pdftoinfo. I've had
>similar problems and it was a matter of finding a way to show me which
>file was the problem, as I explained.

I have created a very simple page that includes the file (which I have attached both the pdf and the text output file) that I am indexing.  I would think that if the problem was with pdftotext then I would see the problem when I did both type of indexing (http and fs).

>Start isolating your problem. Narrow it down to one file. Then divide
>the indexing steps up and I'm sure you will find the problem. Edit the
>pdf conversion script to warn() the file name before calling
>pdftoinfo and pdftotext. Make sure the source pdf is exacatly like the
>pdf file that pdftotext is seeing. Try all the normal debugging steps

I believe I did part of this.  I did narrow it down to one file.   I did not edit the pdf conversion script.  I assume that you are telling me if I was using the pdftotext.pl program?  I am using the win32 exe to do the conversion.

 Bill Moseley <moseley@hank.org> wrote: On Thu, 24 Oct 2002, intervolved none wrote:

> 
> Thanks Bill for the response. It is all PDF's that it runs against. 
> I have downloaded PDF's from the web, tried to index them and all of
> them fail. I have run the program pdftotext.exe at the command line
> and it converts the files fine (I have not brought it up in a hex
> editor to look for unprintables...) . What I mean by fine is that I
> see that text that was in the PDF file and there are no noticible
> problems.

Well then you are not trying the right file. ;)

That error message is from either pdftotext or pdftoinfo. I've had
similar problems and it was a matter of finding a way to show me which
file was the problem, as I explained.

Start isolating your problem. Narrow it down to one file. Then divide
the indexing steps up and I'm sure you will find the problem. Edit the
pdf conversion script to warn() the file name before calling
pdftoinfo and pdftotext. Make sure the source pdf is exacatly like the
pdf file that pdftotext is seeing. Try all the normal debugging steps.

Are you on Windows or OS X? Maybe you are seeing some line ending
conversions.

-- 
Bill Moseley moseley@hank.org



---------------------------------
Do you Yahoo!?
Y! Web Hosting - Let the expert host your web site
--0-1517464465-1035490812=:78545
Content-Type: text/html; charset=us-ascii

<P>&gt;Well then you are not trying the right file. ;)&gt;</P>
<P>&gt;That error message is from either pdftotext or pdftoinfo. I've had<BR>&gt;similar problems and it was a matter of finding a way to show me which<BR>&gt;file was the problem, as I explained.</P>
<P>I have tested swish-e by indexing the files using -fs and -http.&nbsp; The&nbsp;pdf will be indexed fine if I use -fs.&nbsp; If I try to index it by using -http it will not and I will get the error message that the PDF file is damaged.&nbsp; In&nbsp;both methods I am indexing the same file.&nbsp; </P>
<P>&gt;That error message is from either pdftotext or pdftoinfo. I've had<BR>&gt;similar problems and it was a matter of finding a way to show me which<BR>&gt;file was the problem, as I explained.</P>
<P>I have created a very simple page that includes the file (which I have attached both the pdf and the text output file) that I am indexing.&nbsp; I would think that if the problem was with pdftotext then I would see the problem when I did both type of indexing (http and fs).</P>
<P>&gt;Start isolating your problem. Narrow it down to one file. Then divide<BR>&gt;the indexing steps up and I'm sure you will find the problem. Edit the<BR>&gt;pdf conversion script to warn() the file name before calling<BR>&gt;pdftoinfo and pdftotext. Make sure the source pdf is exacatly like the<BR>&gt;pdf file that pdftotext is seeing. Try all the normal debugging steps</P>
<P>I believe I did part of this.&nbsp; I did narrow it down to one file.&nbsp;&nbsp; I did not edit the pdf conversion script.&nbsp; I assume that you are telling me if I was using the pdftotext.pl program?&nbsp; I am using the win32 exe to do the conversion.</P>
<P>&nbsp;<B><I>Bill Moseley &lt;moseley@hank.org&gt;</I></B> wrote: 
<BLOCKQUOTE style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #1010ff 2px solid">On Thu, 24 Oct 2002, intervolved none wrote:<BR><BR>&gt; <BR>&gt; Thanks Bill for the response. It is all PDF's that it runs against. <BR>&gt; I have downloaded PDF's from the web, tried to index them and all of<BR>&gt; them fail. I have run the program pdftotext.exe at the command line<BR>&gt; and it converts the files fine (I have not brought it up in a hex<BR>&gt; editor to look for unprintables...) . What I mean by fine is that I<BR>&gt; see that text that was in the PDF file and there are no noticible<BR>&gt; problems.<BR><BR>Well then you are not trying the right file. ;)<BR><BR>That error message is from either pdftotext or pdftoinfo. I've had<BR>similar problems and it was a matter of finding a way to show me which<BR>file was the problem, as I explained.<BR><BR>Start isolating your problem. Narrow it down to one file. Then divide<BR>the indexing steps up and I'm sure you will fin!
 d the problem. Edit the<BR>pdf conversion script to warn() the file name before calling<BR>pdftoinfo and pdftotext. Make sure the source pdf is exacatly like the<BR>pdf file that pdftotext is seeing. Try all the normal debugging steps.<BR><BR>Are you on Windows or OS X? Maybe you are seeing some line ending<BR>conversions.<BR><BR>-- <BR>Bill Moseley moseley@hank.org<BR></BLOCKQUOTE><p><br><hr size=1>Do you Yahoo!?<br>
<a href="http://webhosting.yahoo.com/ ">Y! Web Hosting</a> - Let the expert host your web site
--0-1517464465-1035490812=:78545--
--0-723580087-1035490812=:78545
Content-Type: text/plain; name="swishe.txt"
Content-Description: swishe.txt
Content-Disposition: inline; filename="swishe.txt"

 



         
         
     Swishe 





 



--0-723580087-1035490812=:78545
Content-type: text/plain
Content-transfer-encoding: 7bit


************************************************************
Non-text elements of this multipart message
have been deleted to make the message conform
with the policies of this list
************************************************************

--0-723580087-1035490812=:78545--
Received on Thu Oct 24 20:38:32 2002