Michael Peters wrote:
> Am I right? If so, how can I fix this? When I use swish-e to index a
> filesystem with HTML docs that have UTF8 I use a FileFilter that changes
> UTF8 chars into HTML entities. Can I do something similar with the spider?
The answer, for anyone else who comes after me, is "Yes!". It's called
output_function and it replaces the normal printing done by the spider with your
own function (not quite the same concept as a filter as it makes me copy some of
the existing spider code into my sub so it works right). I ended up with a sub
like this:
use Encode qw(decode_utf8);
sub filter_output {
my ($server, $content, $uri, $response, $bytecount, $path) = @_;
$$content = decode_utf8($$content);
$$content =~ s/([^\p{IsASCII}])/sprintf('&#x%X;', ord($1))/ge;
my $new_length = length($$content);
print "Path-Name: $path\nContent-Length: $new_length\n";
print "Charset: $server->{charset}\n" if $server->{charset};
print "Last-Mtime: " . $response->last_modified . "\n"
if $response->last_modified;
# Set the parser type if specified by filtering
if ( my $type = delete $server->{parser_type} ) {
print "Document-Type: $type\n";
} elsif ( $response->content_type =~ m!^text/(html|xml|plain)! ) {
$type = $1 eq 'plain' ? 'txt' : $1;
print "Document-Type: $type*\n";
}
print "No-Contents: 1\n" if $server->{no_contents};
print $$content;
}
That seems to do everything I want it to.
--
Michael Peters
Plus Three, LP
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Mar 24 13:49:49 2009