Skip to main content.
home | support | download

Back to List Archive

Re: DirTree works in pipe but not config file on PDF

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Thu Jul 06 2006 - 19:04:32 GMT
ah. The ol' "my config file contained more than I sent to the list" 
trick. ;)

yes, there are two ways to filter. This is due to the origins of 
Swish-e. Over the years it has grown more Perl-centric. So the 
FileFilter option dates back to the pre-Perl days, when you wanted to 
call an external program directly from the swish-e binary. The 
SWISH::Filter framework (Perl) works kind of the same way, but allow for 
more flexibility with respect to multiple filters, user callbacks etc.

Your problem is likely that the file was getting filtered twice: once 
via SWISH::Filter (via DirTree.pl), then again by FileFilter (via 
swish-e). The error msg was from pdftotext itself, not swish-e.

fwiw, the likely future direction is get rid of the FileFilter option 
altogether and let SWISH::Filter be the only way to filter.

pek

Gertjan Hofman scribbled on 7/6/06 1:29 PM:
> Problem - kind of solved.  It turns out, the
> FileFilter directive in the conf file muck up the
> DirTree.pl program.  
> i.e.
> FileFilter .pdf       /usr/bin/pdftotext   "'%p' -"
> 
> which works fine when *not* using -S prog seems to
> interfere when using -S prog and DirTree.pl.
> 
> Clearly I am not understanding something. The
> documentation would suggest that there are TWO
> independent methods - FileFilter, or SWISH::Filter,
> the latter being invoked by DirTree.pl.  So why do the
> FileFilter directive matter when using DirTree.pl and
> why does it muck up the PDF parsin.  Odd.
> 
> Thanks for your help Peter
> 
> Gertjan
> 
> 
> 
> --- Peter Karman <peter@peknet.com> wrote:
> 
>> here's my test. see if you can mimic it exactly:
>>
>> [karpet@cartermac:~/tmp/s]$ swish-e -c conf -S prog
>> -v3 -W0
>> Parsing config file 'conf'
>> Indexing Data Source: "External-Program"
>> Indexing "/usr/local/lib/swish-e/DirTree.pl"
>> External Program found:
>> /usr/local/lib/swish-e/DirTree.pl
>> Indexing ./test.pdf
>> ./test.pdf - Using HTML2 parser -  (38 words)
>>
>> Removing very common words...
>> no words removed.
>> Writing main index...
>> Sorting words ...
>> Sorting 26 words alphabetically
>> Writing header ...
>> Writing index entries ...
>>    Writing word text: Complete
>>    Writing word hash: Complete
>>    Writing word data: Complete
>> 26 unique words indexed.
>> 4 properties sorted.
>> 1 file indexed.  583 total bytes.  38 total words.
>> Elapsed time: 00:00:03 CPU time: 00:00:00
>> Indexing done!
>> [karpet@cartermac:~/tmp/s]$ cat conf
>> #
>> IndexDir /usr/local/lib/swish-e/DirTree.pl
>>
>> SwishProgParameters test.pdf
>>
>> # end of the config file
>>
>>
>>
>> Since you say that it works fine if you run
>> DirTree.pl directly on the 
>> files, I don't suspect a bad .pdf file etc. I'm not
>> sure what's going on 
>> with your setup -- maybe try the full path to the
>> DirTree.pl command?
>>
>>
>>
>>
>>
>> Gertjan Hofman scribbled on 7/5/06 6:05 PM:
>>> Peter,
>>>
>>> Took me  day to get back to this. The problem
>> persists
>>> - see below. The path/file is correct and yet it
>>> claims it's not PDF. 
>>>
>>> I wonder if I am just getting an incorrect error
>> and I
>>> am being misled. I have 5 test files in 
>>> /home/ghofman/tmp10: a .doc, .txt, .ppt, .pdf and
>>> .rtf. When I run DirTree directly and pipe in
>> swish-e
>>> it parses all files correctly. When I use the
>> config
>>> file, only the .txt and .rtf result in words going
>> to
>>> the index file. See the second run below. It's
>> unable
>>> to parse the ppt, doc and pdf. Am I just having a
>> path
>>> problem or something like that ? How do I know
>> where
>>> the DirTree is trying to locate the parsing
>> programs ?
>>> Much appreciated
>>>
>>> Gertjan
>>>
>>>
>>>
>>>
>>> ====RUN ON SINGLE PDF FILE =======
>>>
>>> Indexing Data Source: "External-Program"
>>> Indexing "/room/swish_index/DirTree.pl"
>>> External Program found:
>> /room/swish_index/DirTree.pl
>>> Indexing /home/ghofman/tmp10/swish_text.pdf
>>> Error: May not be a PDF file (continuing anyway)
>>> Error (0): PDF file is damaged - attempting to
>>> reconstruct xref table...
>>> Error: Couldn't find trailer dictionary
>>> Error: Couldn't read xref table
>>> Removing very common words...
>>> no words removed.
>>> Writing main index...
>>> err: No unique words indexed!
>>> .
>>>
>>> === FULL RUN ON DIRECTORY ====
>>>
>>>
>>> Indexing Data Source: "External-Program"
>>> Indexing "/room/swish_index/DirTree.pl"
>>> External Program found:
>> /room/swish_index/DirTree.pl
>>> Indexing now /home/ghofman/tmp10/swish_text.txt
>>> Indexing now /home/ghofman/tmp10/swish_text.pdf
>>> Indexing now /home/ghofman/tmp10/swish_test.xls
>>> Indexing now /home/ghofman/tmp10/swish_test.doc
>>> Indexing now /home/ghofman/tmp10/swish_test.rtf
>>> Indexing now /home/ghofman/tmp10/swish_test.ppt
>>> Error: May not be a PDF file (continuing anyway)
>>> Error (0): PDF file is damaged - attempting to
>>> reconstruct xref table...
>>> Error: Couldn't find trailer dictionary
>>> Error: Couldn't read xref table
>>> ./swtmpfltr0aS7OK is not OLE file or Error
>>> ./swtmpfltrHPmrp9 is not a Word Document.
>>> Removing very common words...
>>> no words removed.
>>> Writing main index...
>>> Sorting words ...
>>> Sorting 17 words alphabetically
>>> Writing header ...
>>> Writing index entries ...
>>>   Writing word text: Complete
>>>   Writing word hash: Complete
>>>   Writing word data: Complete
>>> 17 unique words indexed.
>>> 4 properties sorted.                              
>>    
>>>            
>>> 5 files indexed.  5,010 total bytes.  22 total
>> words.
>>> Elapsed time: 00:00:03 CPU time: 00:00:00
>>>
>>>
>>>
>>> --- Peter Karman <peter@peknet.com> wrote:
>>>
>>>> edit your copy of DirTree.pl like this:
>>>>
>>>>
>>>> sub check_path {
>>>>      my $path = shift;
>>>>      print STDERR "Indexing $path\n";
>>>>      return 1;  # return true to process this
>> file
>>>> }
>>>>
>>>> that will print the name of the path it is about
>> to
>>>> process.
>>>>
>>>>
>>>> Gertjan Hofman scribbled on 6/30/06 5:14 PM:
>>>>> Hi Peter,
>>>>>
>>>>> yes, you are right. Below is the output.  I am
>>>> finding
>>>>> the order of the output a little confusion - it
>>>> would
>>>>> be good if SWISH-e would output the file name
>>>> before
>>>>> it starts processing. Anyway, I am open to
>>>>> suggestions. As far as I can tell, it's just
>>>> unhappy
>>>>> with the PDF. So to me it seems the PDF parsing
>> is
>>>>> somehow different from the pipe example.
>>>>>
>>>>> Gertjan
>>>>>
>>>>>
>>>>> [ghofman@bi35-sensorinfo tmp]$ swish-e -v 5 -c
>>>>> swish_file.conf -S prog
>>>>> Parsing config file 'swish_file.conf'
>>>>> Indexing Data Source: "External-Program"
>>>>> Indexing "/room/swish_index/DirTree.pl"
>>>>> External Program found:
>>>> /room/swish_index/DirTree.pl
>>>>> Error: May not be a PDF file (continuing anyway)
>>>>> Error (0): PDF file is damaged - attempting to
>>>>> reconstruct xref table...
>>>>> Error: Couldn't find trailer dictionary
>>>>> Error: Couldn't read xref table
>>>>> /home/ghofman/tmp10/swish_text.pdf - Using HTML2
>>>>> parser -  (no words indexed)
>>>>>
>>>>> Removing very common words...
>>>>> no words removed.
>>>>> Writing main index...
>>>>> err: No unique words indexed!
>>>>>
>>>>> --- Peter Karman <peter@peknet.com> wrote:
>>>>>
>>>>>> I was suggesting that the -v3 option would tell
>>>> you
>>>>>> if swish-e was in 
> === message truncated ===
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com 
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Thu Jul 6 12:04:33 2006