Re: Problem with IndexContents setting

From: Bill Moseley <moseley(at)>
Date: Thu Oct 17 2002 - 13:03:16 GMT
On Thu, 17 Oct 2002, Zeljan Silje wrote:

> Hi,
> I am using swish2.2.1 on RedHat linux 6.1 (kernel 2.2.12-20), and here is
> the output of querying rpm:
> swish@nera$ rpm -qa| grep xml
> libxml10-1.0.0-2
> libxml-1.8.6-2
> libxml-devel-1.8.6-2

I don't think those are libxml2.

> If I put following line im my config file:
> IndexContents HTML2 .htm .html .shtml
> indexing goes fine on regular files (.html .htm .doc .pdf), but if swish try
> to index x.html file which is not html file (in my case it is really gif
> picture - named by mistake), swish just sits forever on that file. The same
> thing happen if html file is very bad formed (something after </body> tag).

libxml2 seems reasonably robust, but if you throw binary files at it then
you are taking your chances.  

But I don't see what the same thing.  Check again if you are running a
reasonably current libxml2 (2.4.xx).  And if you can really make libxml2
hang with bad formed HTML then make an example available.

> ./swish-e -i ../html/images/  -v9
Indexing Data Source: "File-System"
Indexing "../html/images/"

Checking dir "../html/images"...
  swish.gif - Using DEFAULT (HTML2) parser -  (1 words)
  swish2.gif - Using DEFAULT (HTML2) parser -  (1 words)
  swish2b.gif - Using DEFAULT (HTML2) parser -  (1 words)
  swishbanner1.gif - Using DEFAULT (HTML2) parser -  (1 words)
  dotrule1.gif - Using DEFAULT (HTML2) parser -  (1 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
5 unique words indexed.
4 properties sorted.                                              
5 files indexed.  23235 total bytes.  5 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!

Bill Moseley
Received on Thu Oct 17 13:06:55 2002