On 09/12/2007 01:15 PM, Antonio Barrera wrote:
> Thanks for your help. I used debugging available from spider.pl to
> eliminate a bunch of problematic files. The only issue of course with
> this is I have eliminated files, which at least from a web browser
> appear fine. I have yet to figure out what is wrong with these files.
> But at least I do have a working index.
If you are using the libxml2 parsers, I would suggest caching a few of those
troublesome pages and running them through xmllint. That'll identify encoding
issues, bad tagging etc. xmllint uses the same parsers (well, mostly) that
swish-e does, so you'd get a decent idea of problems.
> Anyway, other than having spider specifically block troublesome files,
> is there a switch available to either program which will skip a file
> that can’t be indexed, but continue the indexing process?
it would have to be in spider.pl, since by the time swish-e knows there's a bad
doc, it has no nice way of recovering. swish-e is just reading N bytes at a
time using -S prog, so if spider.pl tells it N bytes, and it's really X, then
swish-e will irreparably lost.
So likely there's an issue with spider.pl and how it is calculating length()
for docs with unreliable encodings. That's my guess anyway. spider.pl could
probably be made smarter about sanity checking the docs for length and
encoding, and made to fail gracefully somehow. I know there's been talk here
lately about some of the encoding stuff it does.
Peter Karman . peter(at)not-real.peknet.com . http://peknet.com/
Users mailing list
Received on Wed Sep 12 16:40:04 2007