Bill Moseley wrote:
> The problem is if the content-length is reported in characters
> not bytes, then swish would end up reading in the wrong number of
> bytes.
>
> One would expect if the number was reported in characters that if
> anything the length would be too low. But you are seeing:
>
> Warning: Unknown header line: 'th-Name:
>
> which looks like the content-length header was reporting too many
> bytes. That's where I'm a bit confused.
It seems I copy-pasted the wrong error message. That was from an experiment I
was doing where I modified the content-length header to see what was up. The
real error was something like this (from memory now)
Warning: Unknown header line: 'lumi/'
Which as part of the URL for the Path-Name. But it seems like it's still the
same problem. It was reporting a content-length that was too long.
> The spider gets the content from LWP with:
>
> my $content = $response->decoded_content;
Would there be a problem if the site didn't correctly send a Charset header?
Some of the pages I was spidering did not do that.
> So $content is characters at that point. That's what you want -- you
> want characters in your Perl program and octets on the outside.
Is it possible that something in my environment was messing it up? I was calling
swish-e via system() from a UTF8-ified system.
But regardless, I need to be able to show UTF-8 characters safely in the
descriptions of my search results, so escaping those into HTML entities is
needed for that. And after changing the content I need to recalculate the
content-length header anyway, right?
--
Michael Peters
Plus Three, LP
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 26 16:11:06 2009