On 09/12/2007 03:44 PM, Bill Moseley wrote:
> On Wed, Sep 12, 2007 at 03:40:04PM -0500, Peter Karman wrote:
>> So likely there's an issue with spider.pl and how it is calculating length()
>> for docs with unreliable encodings. That's my guess anyway. spider.pl could
>> probably be made smarter about sanity checking the docs for length and
>> encoding, and made to fail gracefully somehow. I know there's been talk here
>> lately about some of the encoding stuff it does.
> The spider just needs to *always* decode on input, then encode back to
> the original charset, and then use length() to report the length.
> That seems like the most simple and correct way to go. Seems right to
> you, Peter?
That seems right.
Only problem I see is similar to one we've hit before: docs that have charset
declared in the <head> but which do not actually adhere to that charset. So how
do you know what the "original charset" really is?
Browsers seem to handle that ok in most cases, displaying the occasional utf-8
glyph in a document that claims to be iso-8859-1. And vice-versa. But swish-e
won't handle that well.
Guess that's not really spider.pl's problem though. As long as it gets length()
correct, and doesn't accidentally double-encode something. If spider.pl can't
encode() back, for whatever reason, it ought to carp and skip that doc. That's
what I meant by graceful failure.
In The Future, it'd be nice for spider.pl to just standardize on utf-8 for
output to swish-e. But of course, that's when swish-e can handle utf-8. :)
Peter Karman . peter(at)not-real.peknet.com . http://peknet.com/
Users mailing list
Received on Wed Sep 12 16:51:07 2007