Skip to main content.
home | support | download

Back to List Archive

Re: Another HTML entities query

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Jan 05 2007 - 04:36:48 GMT
max thom stahl scribbled on 1/4/07 5:06 PM:
> Ok . . . last month I asked about HTML entities and didn't really have a 
> good chance to tweak about with things. What's going on is that the 
> spider is definitely pulling down metadata from my site with entities 
> like &mdash; and &rsquo; and whatnot unencoded, which means it's UTF-8?
> 

those entities resolve to code points that can be represented in UTF-8, yes.


> In spider.pl, I should be able to find a spot to make a call to 
> HTML::Entities::encode_entities to make it so that what gets output to 
> Swish-e  has those entities encoded, right? What I'm getting now is em 
> dashes are, instead of &mdash;, some bizarre-looking character that 
> looks like an `A' with a box around it. Same story with right single 
> quotes, too. . . .
> 
> Is there some way I can do this?
> 

You could pipe the output of spider.pl through another filter before passing to 
swish-e.

  % spider.pl | yourfilter | swish-e -S prog -i stdin

I suggest using something like HTML::Entities or Search::Tools::XML to write 
yourfilter.

If you use Search::Tools, you can also use Search::Tools::Transliterate to then 
convert your UTF-8 multi-byte characters to their single-byte equivalents, which 
swish-e can deal with.

Something like:

#!/usr/bin/perl

use Search::Tools::XML;
use Search::Tools::Transliterate;

my $xml = Search::Tools::XML->new;
my $trans = Search::Tools::Transliterate->new;

while(<>)
{
     print $trans->convert( $xml->unescape( $_ ) );
}

# end




-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Thu Jan 4 20:36:53 2007