David L Norris wrote:
> On Thu, 2005-10-13 at 13:53 -0700, J. David Boyd wrote:
>
>>Running
>>'swish-e -c xxx -T index_words_only'
>>or
>>'swish-e -c xxx -T parse_words'
>>or anything shows that the words it is finding are garbage. It looks as
>>if the pdftotext code is not running.
>
>
> That's certainly a possibility. Hard to say without an example of how
> you're running the index process, though.
>
Here's another one, based on something I got of the archives
------------------------------------
swish.cfg:
IndexDir c:/xfer/swish
FileFilter .pdf ./lib/swish-e/swish_filter.pl '"%p" "%P"'
------------------------------------
and this outputs, running as 'swish-e -c swish.cfg', from the directory
where I installed SWISH-E
------------------------------------
Indexing Data Source: "File-System"
Indexing "c:/xfer/swish"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 135 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: ...
Writing word text: Complete
Writing word hash: ...
Writing word hash: 10%
Writing word hash: 20%
Writing word hash: 30%
Writing word hash: 40%
Writing word hash: 50%
Writing word hash: 60%
Writing word hash: 70%
Writing word hash: 80%
Writing word hash: 90%
Writing word hash: 100%
Writing word hash: Complete
Writing word data: ...
Writing word data: Complete
135 unique words indexed.
Sorting property: swishdocpath
Sorting property: swishtitle
Sorting property: swishdocsize
Sorting property: swishlastmodified
4 properties sorted.
5 files indexed. 416,599 total bytes. 282 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
------------------------------------
However, the only PDF file in the directory being indexed contains 4
words, each on its own line - abercrombie, fitch, sears, roebuck.
I must be doing something wrong.
I can certainly index html files with no problem whatsoever, so I know
the basic program functionality is there.
Received on Fri Oct 14 06:24:46 2005