On Tue, 28 Jan 2003, David Cogley wrote:
> I'm having difficulty with indexing pdf files. I create a large index, but
> it seems to be garbage. "strings gimppdr" gives me no terms I expected.
Well, seems like you are setting it up correctly.
Let me try:
$ cat c
FileFilter .pdf ./_pdf2html.pl
$ ../src/swish-e -c c -i /usr/share/cups/doc-root/translation.pdf
Indexing Data Source: "File-System"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 638 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
638 unique words indexed.
4 properties sorted.
1 file indexed. 50985 total bytes. 3066 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
$ ../src/swish-e -w cups
# SWISH format: 2.3.4
# Search words: cups
# Removed stopwords:
# Number of hits: 1
# Search time: 0.002 seconds
# Run time: 0.031 seconds
1000 /usr/share/cups/doc-root/translation.pdf "CUPS Translation Guide"
$ ../src/swish-e -T index_words_only | wc -l
$ ../src/swish-e -T index_words_only | tail
Can you repeat that with your pdf file?
Bill Moseley email@example.com
Received on Tue Jan 28 22:53:34 2003