Skip to main content.
home | support | download

Back to List Archive

[swish-e] Document parser errors

From: Joe Colgan <joe(at)not-real.joecolgan.com>
Date: Wed Nov 21 2007 - 14:38:04 GMT
Hi All,

Is there anyway to output indexing errors to a file (during the indexing
process) with information that shows which filters/document parsers are
generating what error types? 

I'm experiencing a number of errors during indexing, but in particular a
massive volume of "Format a4 is redefined" messages that I can't seem to
track down to a particular parser? Google hasn't helped.

I've just installed Swish-e 2.4.5 on Centos 4.2 (Kernel 2.6.9-41.ELsmp) and
am planning on using it to index and search about 600,000 MS-Office & PDF
files. I'm running the following index command:

Swish-e -S prog -c /usr/local/lib/swish-e/swish.conf -e

The swish.conf file has the following Directives (I've tried only to include
relevant directives):
* ParserWarnLevel 1 
* IndexReport 1
* DefaultContents HTML
* FuzzyIndexingMode stemming_en1
* IndexOnly .html .htm .doc .ppt .pps .xls .pdf .txt .csv
* StoreDescription HTML <body> 5000
* StoreDescription XML <body> 5000
* StoreDescription TXT <body> 5000
* StoreDescription HTML2 <body> 5000
* StoreDescription XML2 <body> 5000
* StoreDescription TXT2 <body> 5000

I'm not sure, but I think Swish-3 is using the following filters:
* catdoc
* lib2xml
* ppt2txt

Swish-e still successfully indexes more than 50% of the files in my test
runs of ~1200 documents, however, I'd like to improve this if possible. 

Does anybody have any suggestions on how to deal with such indexing errors
(in particular "Format a4 is redefined"), or at least how to track them back
to a particular filter to begin researching a fix?

Thanks in advance for any assistance. I've endeavoured to search the list
archives to avoid asking an old question.

Ta.

Joe Colgan
joe@joecolgan.com
Melbourne, Australia
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Nov 21 09:38:22 2007