Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Unknown header line

From: at <Clint>
Date: Mon, 10 Oct 2011 14:57:27 +0200
Hi,

Thanks for the response Erik.

I was able to track the problem down to a page with Russian content.
Both the page itself and content are encoded utf-8.
Still uncertain about why it crashes swish-e, but when I remove the page
- the indexing works as it should.
Atleast I know what to do now, so I can atleast update the search db in
the meantime.

Clint

On 2011/10/07 04:01 PM, Erik Guss wrote:
> I have run into similar issues that have to do with encoding of
> characters in that or the previous document (throwing the content length
> off? I don't really understand it). For us, it seems to center around 4
> or 5 encoded characters that content editors cut and pasted into the
> web pages. They are: 
>
> apostrophe (u2019), left double quote (u201c), right double quote
> (u201d), emphasis mdash(u2014), ndash(u2013), Horizontal ellipsis
> (u2026).
>
> We've used several ways to find and replace these with their latin-1
> equivalents. You can find them by turning the verbosity of your indexing
> up to 3, and then see which document indexing stops at, or use grep in a
> shell script something like:
> find $dir -name '*.html' -exec grep -P '\xE2\x80\x9D' {} /dev/null ";"
> -print >> /src/utf8/u201d.txt
>
> Then use notepad++ or other editor that supports utf-8 to find it in the
> file and replace.
>
> It is a somewhat manual process but is the only way I've found to clean
> them up.
>
> I hope it helps.
>
> Erik Guss
> Montana State University Library
>
>
>
>
>
> On Fri, 2011-10-07 at 15:30 +0200, Clint wrote:
>> Hi,
>>
>> Swish-e no longer wants to update the index after having run spider.pl.
>> It ran perfectly for more than two years now, but has started to abort
>> and spew out an error I initially had when I started using it.
>>
>> On Linux .
>>
>> I first run on command line:
>> /usr/local/lib/swish-e/spider.pl > output.txt
>>
>> and then
>> swish-e -c swish.conf -S prog -i stdin < output.txt
>>
>> but, this aborts after awhile with
>>
>> Warning: Unknown header line: 'om/linking/' from program stdin
>> err: External program failed to return required headers Path-Name:
>>
>> I have tried all these 3 options individually in spider.pl
>>     my $bytecount = length pack 'C0a*', $$content;
>>
>>     my $bytecount = length($$content);
>>
>>     use bytes;
>>      $bytecount = length $$content;
>>
>> and get the same result.
>>
>> If I look at the output.txt file, I can see that some of the entries
>> don't have "Path-Name" on a line on its own, but instead is sitting next
>> to the closing </html> tag of the previous entry.
>>
>> eg.
>>
>> <!-- InstanceEnd -->
>> </html>Path-Name: http://www.site.com/index.htm
>>
>> and not like
>>
>> <!-- InstanceEnd -->
>> </html>
>> Path-Name: http://www.site.com/index.htm
>>
>> Is it doing this because some of the pages don't end off with a new
>> line, or has this got to do with page encoding or this multi-byte issue,
>> I've seen mentioned.
>>
>> As nothing has been changed on the server, it must be an issue with some
>> of the web pages?
>>
>> Am stuck - please help. Thanks
>>
>>
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Users mailing list
>> Users(at)not-real.lists.swish-e.org
>> http://lists.swish-e.org/listinfo/users
> _______________________________________________
> Users mailing list
> Users(at)not-real.lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
>
>
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Oct 10 2011 - 12:56:12 GMT