At 09:46 AM 09/26/02 -0700, Matt Kynaston wrote:
>I've narrowed the problem down to the "No-Content: 1" header - if I remove:
> <meta name="robots" content="nocontents">
>from the first file being spidered (listed below), swish-e is happy.
Yep, was failing to flush the buffer.
It broke in the last update when I made the HTML2 parser the default parser
if not specified and libxml2 linked in. So the code that deals with
NoContents didn't know to flush the input buffer.
I always wonder how useful NoContents is. Does indexing the path name if
there's not a <title> very helpful?
We will try to get a windows binary out soon with this patch. It will be
in 2.2.1.
For the short term you can edit spider.pl. Look for:
$headers .= "No-Contents: 1\n" if $server->{no_contents};
print "$headers\n$$content";
and maybe try something like commenting out the No-Contents: header.
#$headers .= "No-Contents: 1\n" if $server->{no_contents};
print "$headers\n$$content";
But that will index the doc.
Or if you want to emulate the process add this code a bit higher up than
the above:
$server->{counts}{'Total Docs'}++;
# add this code
if ( $server->{no_contents} ) {
my $title = $response->title || '';
$$content = "<title>$title</title>";
}
That last code is probably better than setting No-Contents and sending the
entire doc just to be thrown away. Oh well.
Index: src/index.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/index.c,v
retrieving revision 1.198.2.1
diff -u -r1.198.2.1 index.c
--- src/index.c 24 Sep 2002 20:41:33 -0000 1.198.2.1
+++ src/index.c 26 Sep 2002 18:00:00 -0000
@@ -479,7 +479,7 @@
#ifdef HAVE_LIBXML2
- if (fprop->doctype == HTML2)
+ if (fprop->doctype == HTML2 || !fprop->doctype)
return parse_HTML( sw, fprop, fi, buffer );
#endif
--
Bill Moseley
mailto:moseley@hank.org
Received on Thu Sep 26 18:50:50 2002