Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] problems with spidering UTF8

From: Michael Peters <mpeters(at)not-real.plusthree.com>
Date: Thu Mar 26 2009 - 20:08:35 GMT
Bill Moseley wrote:

> The problem is if the content-length is reported in characters
> not bytes, then swish would end up reading in the wrong number of
> bytes.
> 
> One would expect if the number was reported in characters that if
> anything the length would be too low.  But you are seeing:
> 
>     Warning: Unknown header line: 'th-Name:
> 
> which looks like the content-length header was reporting too many
> bytes.  That's where I'm a bit confused.

It seems I copy-pasted the wrong error message. That was from an experiment I 
was doing where I modified the content-length header to see what was up. The 
real error was something like this (from memory now)

   Warning: Unknown header line: 'lumi/'

Which as part of the URL for the Path-Name. But it seems like it's still the 
same problem. It was reporting a content-length that was too long.

> The spider gets the content from LWP with:
> 
>         my $content = $response->decoded_content;

Would there be a problem if the site didn't correctly send a Charset header? 
Some of the pages I was spidering did not do that.

> So $content is characters at that point.  That's what you want -- you
> want characters in your Perl program and octets on the outside.

Is it possible that something in my environment was messing it up? I was calling 
swish-e via system() from a UTF8-ified system.

But regardless, I need to be able to show UTF-8 characters safely in the 
descriptions of my search results, so escaping those into HTML entities is 
needed for that. And after changing the content I need to recalculate the 
content-length header anyway, right?

-- 
Michael Peters
Plus Three, LP

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 26 16:11:06 2009