On Thu, Dec 06, 2007 at 07:46:36PM +0100, Louis-David Mitterrand wrote:
> http://trajan.apartia.fr/index.md:158: error: htmlParseEntityRef: expecting ';'
> Warning: Unknown header line: 'CTYPE html' from program spider.pl
Those are two different errors. The first one means that libxml2
found an entity but it wasn't terminated by a ';'.
<a href="http://www.dessy.com/?go=dresses&style
That's not valid. You need
<a href="http://www.dessy.com/?go=dresses&style
The "Unknown header" is due to the spider reporting the incorrect
length of a document to swish.
Now, this has been a problem -- and the spider has been adjusted a few
times for this. So, it might depend on the version of the spider you
have or a problem with your document or the encoding reported by your
server.
Correct behavior should be:
1) fetch document and character encoding from server
2) decode that into perl's internal encoding.
Now length( $doc ) is characters, not bytes.
3) Process the document, extract links, filter, etc.
4) encode the document back into its original encoding.
5) Now length( $doc ) is the length in bytes.
6) pipe the encoded doc to swish telling swish how many bytes
to read.
I think the spider uses:
my $bytecount = length pack 'C0a*', $$content;
But that should be the same as length() on an encoded string, IIRC.
I'm not sure how an incorrect encoding on the server might trigger
this (seems like the decode=>encode process would clean up the text).
Is it possible to put three documents that are linked together on your
web server (to make it easy to test)? Then I can try spidering those
three from here and see if I get the same problem.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Dec 6 16:51:07 2007