This is a multi-part message in MIME format.
------=_NextPart_000_00DC_01C2C789.D55994E0
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit
Bill,
Thank you for the quick response.
My results vary from yours. Here is what I did.
1) Set up a configuration file identical to yours; 1 line.
2) From the command line in the directory with the configuration file:
swish-e -c c -i /usr/share/cups/doc/translation.pdf > index.log
2>index_err.log
3) Copied several files to my Windows box and converted them to windows text
files.
>From the attached file dir_log.txt, you can see that an index was created.
>From the attached file index_log.txt, you can see that 3 words were indexed.
>From the attached file index_err_log.txt, you can see that there seems to be
a problem with _pdf2html.pl at line 101 with a tr///.
Do you have any more thoughts on this?
Thanks!
David Cogley
-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org]
Sent: Tuesday, January 28, 2003 5:52 PM
To: David Cogley
Cc: Multiple recipients of list
Subject: Re: [SWISH-E] Indexing pdf files
On Tue, 28 Jan 2003, David Cogley wrote:
> I'm having difficulty with indexing pdf files. I create a large index,
but
> it seems to be garbage. "strings gimppdr" gives me no terms I expected.
Well, seems like you are setting it up correctly.
Let me try:
$ cat c
FileFilter .pdf ./_pdf2html.pl
$ ../src/swish-e -c c -i /usr/share/cups/doc-root/translation.pdf
Indexing Data Source: "File-System"
Indexing "/usr/share/cups/doc-root/translation.pdf"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 638 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
638 unique words indexed.
4 properties sorted.
1 file indexed. 50985 total bytes. 3066 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
$ ../src/swish-e -w cups
# SWISH format: 2.3.4
# Search words: cups
# Removed stopwords:
# Number of hits: 1
# Search time: 0.002 seconds
# Run time: 0.031 seconds
1000 /usr/share/cups/doc-root/translation.pdf "CUPS Translation Guide"
50985
.
$ ../src/swish-e -T index_words_only | wc -l
639
$ ../src/swish-e -T index_words_only | tail
will
windows
with
within
world
would
x
you
your
Can you repeat that with your pdf file?
--
Bill Moseley moseley@hank.org
------=_NextPart_000_00DC_01C2C789.D55994E0
Content-Type: text/plain;
name="dir_log.txt"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="dir_log.txt"
total 424
-rw-rw-r-- 1 david david 103 Jan 29 11:02 c
-rw-rw-r-- 1 david david 0 Jan 29 11:04 dir.log
-rw-rw-r-- 1 david david 10498 Jan 29 11:03 index_err.log
-rw-rw-r-- 1 david david 1108 Jan 29 11:03 index.log
-rw-r--r-- 1 david david 402249 Jan 29 11:03 index.swish-e
-rw-r--r-- 1 david david 73 Jan 29 11:03 index.swish-e.prop
------=_NextPart_000_00DC_01C2C789.D55994E0
Content-Type: text/plain;
name="index_log.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="index_log.txt"
Indexing Data Source: "File-System"=0A=
Indexing "/usr/share/cups/doc/translation.pdf"=0A=
Removing very common words...=0A=
no words removed.=0A=
Writing main index...=0A=
Sorting words ...=0A=
Sorting 3 words alphabetically=0A=
Writing header ...=0A=
Writing index entries ...=0A=
Writing word text: ...
Writing word text: Complete=0A=
Writing word hash: ...
Writing word hash: 10%
Writing word hash: 20%
Writing word hash: 30%
Writing word hash: 40%
Writing word hash: 50%
Writing word hash: 60%
Writing word hash: 70%
Writing word hash: 80%
Writing word hash: 90%
Writing word hash: 100%
Writing word hash: Complete=0A=
Writing word data: ...
Writing word data: Complete=0A=
3 unique words indexed.=0A=
Sorting property: swishdocpath =20
Sorting property: swishtitle =20
Sorting property: swishdocsize =20
Sorting property: swishlastmodified =20
4 properties sorted. =0A=
1 file indexed. 50985 total bytes. 3 total words.=0A=
Elapsed time: 00:00:01 CPU time: 00:00:00=0A=
Indexing done!=0A=
------=_NextPart_000_00DC_01C2C789.D55994E0
Content-Type: text/plain;
name="index_err_log.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="index_err_log.txt"
Malformed UTF-8 character (unexpected continuation byte 0xae, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 77.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 99.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 101.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 103.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 105.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 107.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 109.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 135.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 137.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 139.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 141.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 143.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 145.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 147.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 149.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 151.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 153.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 161.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 163.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 165.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 167.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 169.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 171.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 173.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 175.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 177.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 179.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 181.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 183.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 217.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 219.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 221.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 223.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 225.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 227.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 229.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 231.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 233.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 235.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 237.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 239.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 241.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 243.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 245.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 247.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 249.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 251.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 253.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 255.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 257.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 259.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 261.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 263.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 265.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 267.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 269.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 576.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 578.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 580.
Malformed UTF-8 character (unexpected continuation byte 0xb7, with no =
preceding start byte) in transliteration (tr///) at =
/home/david/bin/_pdf2html.pl line 101, <F> line 582.
------=_NextPart_000_00DC_01C2C789.D55994E0--
Received on Wed Jan 29 23:43:02 2003