Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] can't index with

From: Bill Moseley <moseley(at)>
Date: Thu Dec 06 2007 - 21:50:58 GMT
On Thu, Dec 06, 2007 at 07:46:36PM +0100, Louis-David Mitterrand wrote:
> error: htmlParseEntityRef: expecting ';'

> 	Warning: Unknown header line: 'CTYPE html' from program

Those are two different errors.  The first one means that libxml2
found an entity but it wasn't terminated by a ';'.

<a href="

That's not valid.  You need 

<a href=";style

The "Unknown header" is due to the spider reporting the incorrect
length of a document to swish.

Now, this has been a problem -- and the spider has been adjusted a few
times for this.  So, it might depend on the version of the spider you
have or a problem with your document or the encoding reported by your

Correct behavior should be:

    1) fetch document and character encoding from server
    2) decode that into perl's internal encoding. 
       Now length( $doc ) is characters, not bytes.
    3) Process the document, extract links, filter, etc.
    4) encode the document back into its original encoding.
    5) Now length( $doc ) is the length in bytes.
    6) pipe the encoded doc to swish telling swish how many bytes
       to read.

I think the spider uses:

 my $bytecount = length pack 'C0a*', $$content;

But that should be the same as length() on an encoded string, IIRC.

I'm not sure how an incorrect encoding on the server might trigger
this (seems like the decode=>encode process would clean up the text).

Is it possible to put three documents that are linked together on your
web server (to make it easy to test)?  Then I can try spidering those
three from here and see if I get the same problem.

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:

Users mailing list
Received on Thu Dec 6 16:51:07 2007