Skip to main content.
home | support | download

Back to List Archive

[swish-e] problems with spidering UTF8

From: Michael Peters <mpeters(at)not-real.plusthree.com>
Date: Tue Mar 24 2009 - 16:22:44 GMT
I've been trying to track down some weird errors that we're getting from using 
spider.pl. It will be humming along ok and then I'll get an error like this:

Warning: Unknown header line: 'th-Name:

Swish-e can't recover and I'm left with no index even though it's indexed lots 
of content before this. So it seems that the file output by the spider isn't 
completely correct. My guess is that the Content-Length is off (not likely since 
that's coming from the server and all the content does make it into the spider's 
output file) or that swish-e is encountering some multi-byte characters in the 
output and is getting confused somehow. This prevents it from finding the right 
end of the document and thus misses the headers of the next document.

Am I right? If so, how can I fix this? When I use swish-e to index a filesystem 
with HTML docs that have UTF8 I use a FileFilter that changes UTF8 chars into 
HTML entities. Can I do something similar with the spider?

-- 
Michael Peters
Plus Three, LP

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Mar 24 12:25:13 2009