On Thu, 17 Oct 2002, Zeljan Silje wrote:
> I am using swish2.2.1 on RedHat linux 6.1 (kernel 2.2.12-20), and here is
> the output of querying rpm:
> swish@nera$ rpm -qa| grep xml
I don't think those are libxml2.
> If I put following line im my config file:
> IndexContents HTML2 .htm .html .shtml
> indexing goes fine on regular files (.html .htm .doc .pdf), but if swish try
> to index x.html file which is not html file (in my case it is really gif
> picture - named by mistake), swish just sits forever on that file. The same
> thing happen if html file is very bad formed (something after </body> tag).
libxml2 seems reasonably robust, but if you throw binary files at it then
you are taking your chances.
But I don't see what the same thing. Check again if you are running a
reasonably current libxml2 (2.4.xx). And if you can really make libxml2
hang with bad formed HTML then make an example available.
> ./swish-e -i ../html/images/ -v9
Indexing Data Source: "File-System"
Checking dir "../html/images"...
swish.gif - Using DEFAULT (HTML2) parser - (1 words)
swish2.gif - Using DEFAULT (HTML2) parser - (1 words)
swish2b.gif - Using DEFAULT (HTML2) parser - (1 words)
swishbanner1.gif - Using DEFAULT (HTML2) parser - (1 words)
dotrule1.gif - Using DEFAULT (HTML2) parser - (1 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
5 unique words indexed.
4 properties sorted.
5 files indexed. 23235 total bytes. 5 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Bill Moseley email@example.com
Received on Thu Oct 17 13:06:55 2002