Skip to main content.
home | support | download

Back to List Archive

Re: DirTree works in pipe but not config file on PDF

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Thu Jul 06 2006 - 13:13:18 GMT
here's my test. see if you can mimic it exactly:

[karpet@cartermac:~/tmp/s]$ swish-e -c conf -S prog -v3 -W0
Parsing config file 'conf'
Indexing Data Source: "External-Program"
Indexing "/usr/local/lib/swish-e/DirTree.pl"
External Program found: /usr/local/lib/swish-e/DirTree.pl
Indexing ./test.pdf
/test.pdf - Using HTML2 parser -  (38 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 26 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
26 unique words indexed.
4 properties sorted.
1 file indexed.  583 total bytes.  38 total words.
Elapsed time: 00:00:03 CPU time: 00:00:00
Indexing done!
[karpet@cartermac:~/tmp/s]$ cat conf
#
IndexDir /usr/local/lib/swish-e/DirTree.pl

SwishProgParameters test.pdf

# end of the config file



Since you say that it works fine if you run DirTree.pl directly on the 
files, I don't suspect a bad .pdf file etc. I'm not sure what's going on 
with your setup -- maybe try the full path to the DirTree.pl command?





Gertjan Hofman scribbled on 7/5/06 6:05 PM:
> Peter,
> 
> Took me  day to get back to this. The problem persists
> - see below. The path/file is correct and yet it
> claims it's not PDF. 
> 
> I wonder if I am just getting an incorrect error and I
> am being misled. I have 5 test files in 
> /home/ghofman/tmp10: a .doc, .txt, .ppt, .pdf and
> .rtf. When I run DirTree directly and pipe in swish-e
> it parses all files correctly. When I use the config
> file, only the .txt and .rtf result in words going to
> the index file. See the second run below. It's unable
> to parse the ppt, doc and pdf. Am I just having a path
> problem or something like that ? How do I know where
> the DirTree is trying to locate the parsing programs ?
> 
> Much appreciated
> 
> Gertjan
> 
> 
> 
> 
> ====RUN ON SINGLE PDF FILE =======
> 
> Indexing Data Source: "External-Program"
> Indexing "/room/swish_index/DirTree.pl"
> External Program found: /room/swish_index/DirTree.pl
> Indexing /home/ghofman/tmp10/swish_text.pdf
> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to
> reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> Removing very common words...
> no words removed.
> Writing main index...
> err: No unique words indexed!
> .
> 
> === FULL RUN ON DIRECTORY ====
> 
> 
> Indexing Data Source: "External-Program"
> Indexing "/room/swish_index/DirTree.pl"
> External Program found: /room/swish_index/DirTree.pl
> Indexing now /home/ghofman/tmp10/swish_text.txt
> Indexing now /home/ghofman/tmp10/swish_text.pdf
> Indexing now /home/ghofman/tmp10/swish_test.xls
> Indexing now /home/ghofman/tmp10/swish_test.doc
> Indexing now /home/ghofman/tmp10/swish_test.rtf
> Indexing now /home/ghofman/tmp10/swish_test.ppt
> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to
> reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> ./swtmpfltr0aS7OK is not OLE file or Error
> ./swtmpfltrHPmrp9 is not a Word Document.
> Removing very common words...
> no words removed.
> Writing main index...
> Sorting words ...
> Sorting 17 words alphabetically
> Writing header ...
> Writing index entries ...
>   Writing word text: Complete
>   Writing word hash: Complete
>   Writing word data: Complete
> 17 unique words indexed.
> 4 properties sorted.                                  
>            
> 5 files indexed.  5,010 total bytes.  22 total words.
> Elapsed time: 00:00:03 CPU time: 00:00:00
> 
> 
> 
> --- Peter Karman <peter@peknet.com> wrote:
> 
>> edit your copy of DirTree.pl like this:
>>
>>
>> sub check_path {
>>      my $path = shift;
>>      print STDERR "Indexing $path\n";
>>      return 1;  # return true to process this file
>> }
>>
>> that will print the name of the path it is about to
>> process.
>>
>>
>> Gertjan Hofman scribbled on 6/30/06 5:14 PM:
>>> Hi Peter,
>>>
>>> yes, you are right. Below is the output.  I am
>> finding
>>> the order of the output a little confusion - it
>> would
>>> be good if SWISH-e would output the file name
>> before
>>> it starts processing. Anyway, I am open to
>>> suggestions. As far as I can tell, it's just
>> unhappy
>>> with the PDF. So to me it seems the PDF parsing is
>>> somehow different from the pipe example.
>>>
>>> Gertjan
>>>
>>>
>>> [ghofman@bi35-sensorinfo tmp]$ swish-e -v 5 -c
>>> swish_file.conf -S prog
>>> Parsing config file 'swish_file.conf'
>>> Indexing Data Source: "External-Program"
>>> Indexing "/room/swish_index/DirTree.pl"
>>> External Program found:
>> /room/swish_index/DirTree.pl
>>> Error: May not be a PDF file (continuing anyway)
>>> Error (0): PDF file is damaged - attempting to
>>> reconstruct xref table...
>>> Error: Couldn't find trailer dictionary
>>> Error: Couldn't read xref table
>>> /home/ghofman/tmp10/swish_text.pdf - Using HTML2
>>> parser -  (no words indexed)
>>>
>>> Removing very common words...
>>> no words removed.
>>> Writing main index...
>>> err: No unique words indexed!
>>>
>>> --- Peter Karman <peter@peknet.com> wrote:
>>>
>>>> I was suggesting that the -v3 option would tell
>> you
>>>> if swish-e was in 
>>>> fact parsing swish_test.pdf or if somehow it was
>>>> being passed something 
>>>> different. I just tried your example here and it
>>>> worked for me, so I was 
>>>> suggesting a way for you to start to debug what's
>>>> going on.
>>>>
>>>> Gertjan Hofman scribbled on 6/30/06 3:59 PM:
>>>>> Peter -
>>>>>
>>>>> Not sure I understand - I am passing only 1 file
>> -
>>>>> swish_test.pdf (as indiced in the config file I
>>>>> enclosed).  Of course I started with entire
>>>> folders
>>>>> but for sake of demonstration of the problem
>> only
>>>>> parse the one file
>>>>>
>>>>> I note there are older messages in the mailing
>>>> list
>>>>> with similar sounding problems - in that case
>>>>> spider.pl failed from a config file but worked
>> in
>>>> a
>>>>> pipe...
>>>>>
>>>>> Thanks
>>>>>
>>>>> Gertjan
>>>>>
>>>>>
>>>>> --- Peter Karman <peter@peknet.com> wrote:
>>>>>
>>>>>> Gertjan Hofman scribbled on 6/29/06 11:59 PM:
>>>>>>
>>>>>>> TRY 1: USING CONFIG FILE
>>>>>>>
>>>>>>> gertjan-laptop:~/tmp/swish_test> swish-e -S
>> prog
>>>>>> -c
>>>>>>> swish_file.conf
>>>>>>> Indexing Data Source: "External-Program"
>>>>>>> Indexing "./DirTree.pl"
>>>>>>> External Program found: ./DirTree.pl
>>>>>>> Error: May not be a PDF file (continuing
>> anyway)
>>>>>>> Error (0): PDF file is damaged - attempting to
>>>>>>> reconstruct xref table...
>>>>>>> Error: Couldn't find trailer dictionary
>>>>>>> Error: Couldn't read xref table
>>>>>>> Removing very common words...
>>>>>>> no words removed.
>>>>>>> Writing main index...
>>>>>>> err: No unique words indexed!
>>>>>>>
>>>>>> add the -v3 option to get more verbose. That
>>>> should
>>>>>> tell you the name of 
>>>>>> the file being parsed with SWISH::Filter
>> (xpdf).
>>>> I'm
>>>>>> betting the file 
>>>>>> isn't getting passed correctly.
>>>>>>
>>>>>> -- 
>>>>>> Peter Karman  .  http://peknet.com/  . 
>>>>>> peter@peknet.com
>>>>>>
>>>>>
>> __________________________________________________
>>>>> Do You Yahoo!?
>>>>> Tired of spam?  Yahoo! Mail has the best spam
>>>> protection around 
>>>>> http://mail.yahoo.com 
>>>>>
>>>> -- 
>>>> Peter Karman  .  http://peknet.com/  . 
>>>> peter@peknet.com
>>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam?  Yahoo! Mail has the best spam
>> protection around 
>>> http://mail.yahoo.com 
>>>
>> -- 
>> Peter Karman  .  http://peknet.com/  . 
>> peter@peknet.com
>>
> 
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Thu Jul 6 06:13:18 2006