On Oct 27, 2010, at 2:18 PM, Bill Moseley wrote:
> I have not looked at that code in, well, years. Swish *should be working with bytes, so my guess is that the spider is telling swish that the content is one byte longer than it really is.
>
> http://dev.swish-e.org/browser/swish-e/trunk/prog-bin/spider.pl.in#L1409
>
> # Re-encode the data for outside of Perl
> 1407 eval {
> 1408 # Need to only require Encode here?
> 1409 $$content = Encode::encode( $server->{charset}, $$content )
> 1410 if $server->{charset};
> 1411 };
> 1412 if ( $@ ) {
> 1413 print STDERR "Warning: document '", $response->request->uri, "' could not be encoded to charset '$server->{charset}'\n";
> 1414 delete $server->{charset};
> 1415 }
>
> $content should now be a reference to a string of bytes.
>
>
> 1416
> 1417 $server->{counts}{'Total Bytes'} += length $$content;
> 1418 $server->{counts}{'Total Docs'}++;
> 1419
> 1420
> 1421 # ugly and maybe expensive, but perhaps more portable than "use bytes"
> 1422 my $bytecount = length pack 'C0a*', $$content;
> 1423
>
> This is a wild guess, but what if you replace that with:
>
> my $bytecount = length $$content;
That did the trick! Also, one of the items where the spider was failing was on the <, <<, >, and >> character sets that were being used as navigation links between pages. I replaced those with words like "Prev Month" instead, and most of the errors went away. I still get quite a few errors where subject lines have the & character in them. The output still needs tweaking. I'm not sure how to get the document path removed from the output. swishdocpath is not anywhere in the cgi conf file. Also 'swishdescription' I am unable to remove from the front of the excerpt.
Example @ http://type2.com/cgi-bin/search.cgi?query=westy&submit=Search!&sort=swishrank&si=3
As a side note, this will be the solution for a another previously long running thread regarding my attempt to get swish3 to handle this task.
http://swish-e.org/archive/2009-12/12787.html
Troy
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sun Oct 31 09:23:30 2010