On Wed, Sep 12, 2007 at 03:51:08PM -0500, Peter Karman wrote:
> Only problem I see is similar to one we've hit before: docs that have charset
> declared in the <head> but which do not actually adhere to that charset. So how
> do you know what the "original charset" really is?
I'm thinking that should not be a problem (he says without trying)
because if I can get the content decoded into Perl, then encoding it
back to that charset should not be a problem as far as Perl is
concerned. Maybe not be the same doc as going in but should be valid
characters. The encode and decode methods have a CHECK value to say
what to do for bad characters.
What I'm not so clear on is how libxml2 will detect the correct charset.
I think it will look at a <meta> tag, but not all docs will have that.
I wonder if it would not be smart to try and parse the document and
try and replace. How that would be done is different for xml or html
docs, and working with plain text is something else.
Then there's the issue of filters.
Do you remember what libxml2 does if it doesn't find an explicit
> In The Future, it'd be nice for spider.pl to just standardize on utf-8 for
> output to swish-e. But of course, that's when swish-e can handle utf-8. :)
It could do that now since libxml2 will read the utf8 fine. Swish, in
parser.c converts to 8859-1 when reading the output from libxml2.
Again, just would need to remove any charset meta tags (or replace).
And then I'm not sure what to do with filtered content. But, it's the
filters responsibility to correctly get the content decoded
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Users mailing list
Received on Wed Sep 12 17:18:29 2007