Hi
The HTML Parser is choking when indexing image pdf's - please refer the
output below
I can suppress the error messages by lowering the ParserWarnLevel setting
to 1, but I am concerned that swish-e's attempt to index 'image pdf's' is
slowing it down. Setting the TXT parser rather than the HTML parser as the
default also suppresses the error output, but results in only .txt
documents being indexed. The libxml2 version used when compiling swish-e
is an old one...perhaps this has something to do with it?
As a separate question, is there a way to at least index the *filename* of
image pdf documents?
Swish-e 2.4.7 is running on a qnap device, perl 5.14.2 freshly compiled
from source, uname -a:
Linux NASC2089B 2.6.33.2 #1 Tue Aug 16 00:31:52 CST 2011 armv5tel GNU/Linux
Here is the output:
for swish-e forum
ÃÂÃWi³æ"# BºÂ
]BÂÂÂÂÃGâ<
^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: htmlParseStartTag:
invalid element name
;4EÂùÂ
GKçÂÂ%?Â:Ã
GHG>Âr^Q¾>4æÃ²§Iô"Â"9Â
DÂgÂÂÂ
^
/shÃÃM#ÃRY¦tD5GÃerver_dir/CorrespXÂÃqGÂÂ"uÂ!ÂHU.4,ÃìDDU&WHe/National_1585_0001.pdf:797:
error: htmlParseEntityRef:
expecting ';'
^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: error parsing
attribute name
HU.4,ÃìDDU&WHª#¤ô)2¡tÂiÃÂ+Hi¡w
øsÂtÃ"ùÃÂÂÂ3":<G
^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: Tag g invalid
&WHª#¤ô)2¡tÂiÃÂ+Hi¡w
øsÂtÃ"ùÃÂÂÂ3":<G¢ïL®áÂg
^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: Couldn't find end
of Start Tag g
&WHª#¤ô)2¡tÂiÃÂ+Hi¡w
øsÂtÃ"ùÃÂÂÂ3":<G¢ïL®áÂg
^
/share/MD0_DATA/server_dir/Correspondence/2011_Correspondence/National_1585_0001.pdf:797:
error: htmlParseEntityRef:
no name
¡w
øsÂtÃ"ùÃÂÂÂ3":<G¢ïL®áÂV¤GÃ
7ÂFk'^?Â'ÃfÂgÂõÂ&
Thanks
Michael
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Mar 13 2012 - 09:27:34 GMT