David L Norris wrote:
> On Thu, 2005-10-13 at 13:53 -0700, J. David Boyd wrote:
>
>>Running
>>'swish-e -c xxx -T index_words_only'
>>or
>>'swish-e -c xxx -T parse_words'
>>or anything shows that the words it is finding are garbage. It looks as
>>if the pdftotext code is not running.
>
>
> That's certainly a possibility. Hard to say without an example of how
> you're running the index process, though.
>
Okay, here's one example:
------------------------------------
swish.cfg:
IndexDir .
IndexContents TXT2 .pdf
FileFilter .pdf pdftotext "'%p' -"
------------------------------------
and this outputs: (running as 'swish-e -c swish.cfg', from the directory
where the PDF file is...
------------------------------------
Indexing Data Source: "File-System"
Indexing "."
Error: Couldn't open file ''.\test.pdf''
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 44 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: ...
Writing word text: Complete
Writing word hash: ...
Writing word hash: 10%
Writing word hash: 20%
Writing word hash: 30%
Writing word hash: 40%
Writing word hash: 50%
Writing word hash: 60%
Writing word hash: 70%
Writing word hash: 80%
Writing word hash: 90%
Writing word hash: 100%
Writing word hash: Complete
Writing word data: ...
Writing word data: Complete
44 unique words indexed.
Sorting property: swishdocpath
Sorting property: swishtitle
Sorting property: swishdocsize
Sorting property: swishlastmodified
4 properties sorted.
7 files indexed. 416,453 total bytes. 69 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
------------------------------------
And here is the output of 'swish-e -T index_words_only
------------------------------------
02
09
10
14
2
2005
29
3
3index
4
a
agmz
cfg0co
co
data
daylight
e
eastern
f
file
filefilter
format
frx
index
indexcontents
indexdir
indexing
p
pdf
pdf6c
pdftotext
prop
s
source
swish
system
tempco
test
time
txt2
x
z
ë
ü
------------------------------------
none of which are in the PDF file I am trying to index.
Dave
Received on Fri Oct 14 06:24:15 2005