At 01:33 PM 02/06/02 -0800, AHatton@oxfam.org.uk wrote:
>Whilst using swish-e -S http.. etc works fine for indexing other content
>we can't get it to index PDF files.
>We are using version swish-e 2.0
I'd strongly recommend upgrading to the dev version.
I'd also strongly recommend using -S prog with spider.pl when using the
-dev version, but that's a minor issue here.
>#!/bin/sh
>#/usr/X11R6/bin/pdftotext -q $1 -
>#/usr/X11R6/bin/pdftotext "$1" - 2>/dev/null
>/usr/X11R6/bin/pdftotext "$1" -
You shouldn't need a shell script. You should be able to call pdftotext
directly from the FileFilter command.
See:
http://swish-e.org/2.2/docs/SWISH-CONFIG.html#Document_Filter_Directives
FileFilter .pdf pdftotext "'%p' -"
Here's working with the current -dev version:
> cat c
FileFilter .pdf pdftotext "'%p' -"
Delay 0
> ./swish-e -c c -S http -i http://www.sanface.com/epdtest.pdf -T
indexed_words
Indexing Data Source: "HTTP-Crawler"
Indexing "http://www.sanface.com/epdtest.pdf"
Adding:[1:swishdefault(1)] 'test' Pos:1 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'with' Pos:2 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'the' Pos:3 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'tiger' Pos:4 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'epd' Pos:5 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'converted' Pos:6 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'from' Pos:7 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'the' Pos:8 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'standard' Pos:9 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'postscript' Pos:10 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'tiger' Pos:11 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'by' Pos:12 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'pstoepd' Pos:13 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'converter' Pos:14 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] '1' Pos:15 Stuct:0x1 ( FILE )
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 13 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
13 unique words indexed.
4 properties sorted.
1 file indexed. 109922 total bytes. 15 total words.
Elapsed time: 00:00:03 CPU time: 00:00:00
Indexing done!
--
Bill Moseley
mailto:moseley@hank.org
Received on Wed Feb 6 21:56:59 2002