Skip to main content.
home | support | download

Back to List Archive

Re: Indexing pdf files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Jan 28 2003 - 22:53:19 GMT
On Tue, 28 Jan 2003, David Cogley wrote:

> I'm having difficulty with indexing pdf files.  I create a large index, but
> it seems to be garbage.  "strings gimppdr" gives me no terms I expected.

Well, seems like you are setting it up correctly.

Let me try:

$ cat c
FileFilter .pdf ./_pdf2html.pl

$ ../src/swish-e -c c -i /usr/share/cups/doc-root/translation.pdf
Indexing Data Source: "File-System"
Indexing "/usr/share/cups/doc-root/translation.pdf"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 638 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
638 unique words indexed.
4 properties sorted.                                              
1 file indexed.  50985 total bytes.  3066 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!


$ ../src/swish-e -w cups
# SWISH format: 2.3.4
# Search words: cups
# Removed stopwords: 
# Number of hits: 1
# Search time: 0.002 seconds
# Run time: 0.031 seconds
1000 /usr/share/cups/doc-root/translation.pdf "CUPS Translation Guide"
50985
.


$ ../src/swish-e -T index_words_only | wc -l
    639

$ ../src/swish-e -T index_words_only | tail
will
windows
with
within
world
would
x
you
your

Can you repeat that with your pdf file?


-- 
Bill Moseley moseley@hank.org
Received on Tue Jan 28 22:53:34 2003