Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] can't index with spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Dec 06 2007 - 21:50:58 GMT
On Thu, Dec 06, 2007 at 07:46:36PM +0100, Louis-David Mitterrand wrote:
> 	http://trajan.apartia.fr/index.md:158: error: htmlParseEntityRef: expecting ';'

> 	Warning: Unknown header line: 'CTYPE html' from program spider.pl

Those are two different errors.  The first one means that libxml2
found an entity but it wasn't terminated by a ';'.

<a href="http://www.dessy.com/?go=dresses&style

That's not valid.  You need 

<a href="http://www.dessy.com/?go=dresses&amp;style


The "Unknown header" is due to the spider reporting the incorrect
length of a document to swish.

Now, this has been a problem -- and the spider has been adjusted a few
times for this.  So, it might depend on the version of the spider you
have or a problem with your document or the encoding reported by your
server.

Correct behavior should be:

    1) fetch document and character encoding from server
    2) decode that into perl's internal encoding. 
       Now length( $doc ) is characters, not bytes.
    3) Process the document, extract links, filter, etc.
    4) encode the document back into its original encoding.
    5) Now length( $doc ) is the length in bytes.
    6) pipe the encoded doc to swish telling swish how many bytes
       to read.

I think the spider uses:

 my $bytecount = length pack 'C0a*', $$content;

But that should be the same as length() on an encoded string, IIRC.

I'm not sure how an incorrect encoding on the server might trigger
this (seems like the decode=>encode process would clean up the text).


Is it possible to put three documents that are linked together on your
web server (to make it easy to test)?  Then I can try spidering those
three from here and see if I get the same problem.





-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Dec 6 16:51:07 2007