Skip to main content.
home | support | download

Back to List Archive

[swish-e] HTML Parser chokes when indexing image pdf's

From: Dr Michael Daly <"Dr>
Date: Tue, 13 Mar 2012 20:17:31 +1100 (EST)
Hi
The HTML Parser is choking when indexing image pdf's - please refer the
output below

I can suppress the error messages by lowering the ParserWarnLevel setting
to 1, but I am concerned that swish-e's attempt to index 'image pdf's' is
slowing it down. Setting the TXT parser rather than the HTML parser as the
default also suppresses the error output, but results in only .txt
documents being indexed. The libxml2 version used when compiling swish-e
is an old one...perhaps this has something to do with it?


As a separate question, is there a way to at least index the *filename* of
image pdf documents?

Swish-e 2.4.7 is running on a qnap device, perl 5.14.2 freshly compiled
from source, uname -a:
Linux NASC2089B 2.6.33.2 #1 Tue Aug 16 00:31:52 CST 2011 armv5tel GNU/Linux


Here is the output:

for swish-e forum
Wi³æ"# Bº
                                 ]BGâ<
                                                                               ^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: htmlParseStartTag:

invalid element name
;4Eù
       GKç%?:
                  GHG>r^Q­¾>4沧Iô""9
                                               Dg
                                                            ^
/shM#RY¦tD5Gerver_dir/CorrespXqG"u!HU.4,ìDDU&WHe/National_1585_0001.pdf:797:
error: htmlParseEntityRef:

expecting ';'
                                                                               ^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: error parsing

attribute name
HU.4,ìDDU&WHª#¤ô)2¡ti+Hi¡w
øst"ù3":<G
                                                                               ^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: Tag g invalid
&WHª#¤ô)2¡ti+Hi¡w
øst"ù3":<G¢ïL®ág
                                                                               ^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: Couldn't find end

of Start Tag g
&WHª#¤ô)2¡ti+Hi¡w
øst"ù3":<G¢ïL®ág
                                                                               ^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: htmlParseEntityRef:

no name
¡w
øst"ù3":<G¢ïL®áV¤G
                                 7Fk'^?'fgõ&

Thanks
Michael
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Mar 13 2012 - 09:27:34 GMT