Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] *some* pdf documents not indexed

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Jul 29 2007 - 05:19:27 GMT
On Sat, Jul 28, 2007 at 01:43:07PM +1000, Dr Michael Daly wrote:
> Dear list
> If anyone can solve this mystery, it would be great! Swish-e 2.4.5 (on
> centos) is failing to index some pdf documents. Here is the index file:

>     IndexDir /home/server_dir/Resources/Research/2007
>     ReplaceRules remove /home/server_dir/
>     IndexOnly .htm .html .txt .doc .pdf
>     IndexContents TXT* .txt
>     DefaultContents HTML*
>     ParserWarnLevel 9
>     IndexFile /home/indices/for_index4.index
> 
> 
> 4. this seems to work:
> swish-e -i
> /home/server_dir/Resources/Research/2007/Low_Purine_Diet_405.pdf -T
> indexed_words | less
> eg
> Warning: Substituted 86 embedded null character(s) in file
> '/home/server_dir/Resources/Research/2007/Low_Purine_Diet_405.pdf'
>  with a newline

You are not telling swish how to convert the pdf file to a text file.

You need to specify a filter or use a script like spider.pl or
DirTree.pl that knows to use pdftotext on the pdf to convert it to
text.



_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sun Jul 29 01:19:25 2007