Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Unknown header line

From: at <Erik>
Date: Fri, 07 Oct 2011 08:01:34 -0600
I have run into similar issues that have to do with encoding of
characters in that or the previous document (throwing the content length
off? I don't really understand it). For us, it seems to center around 4
or 5 encoded characters that content editors cut and pasted into the
web pages. They are: 

apostrophe (u2019), left double quote (u201c), right double quote
(u201d), emphasis mdash(u2014), ndash(u2013), Horizontal ellipsis
(u2026).

We've used several ways to find and replace these with their latin-1
equivalents. You can find them by turning the verbosity of your indexing
up to 3, and then see which document indexing stops at, or use grep in a
shell script something like:
find $dir -name '*.html' -exec grep -P '\xE2\x80\x9D' {} /dev/null ";"
-print >> /src/utf8/u201d.txt

Then use notepad++ or other editor that supports utf-8 to find it in the
file and replace.

It is a somewhat manual process but is the only way I've found to clean
them up.

I hope it helps.

Erik Guss
Montana State University Library





On Fri, 2011-10-07 at 15:30 +0200, Clint wrote:
> Hi,
> 
> Swish-e no longer wants to update the index after having run spider.pl.
> It ran perfectly for more than two years now, but has started to abort
> and spew out an error I initially had when I started using it.
> 
> On Linux .
> 
> I first run on command line:
> /usr/local/lib/swish-e/spider.pl > output.txt
> 
> and then
> swish-e -c swish.conf -S prog -i stdin < output.txt
> 
> but, this aborts after awhile with
> 
> Warning: Unknown header line: 'om/linking/' from program stdin
> err: External program failed to return required headers Path-Name:
> 
> I have tried all these 3 options individually in spider.pl
>     my $bytecount = length pack 'C0a*', $$content;
> 
>     my $bytecount = length($$content);
> 
>     use bytes;
>      $bytecount = length $$content;
> 
> and get the same result.
> 
> If I look at the output.txt file, I can see that some of the entries
> don't have "Path-Name" on a line on its own, but instead is sitting next
> to the closing </html> tag of the previous entry.
> 
> eg.
> 
> <!-- InstanceEnd -->
> </html>Path-Name: http://www.site.com/index.htm
> 
> and not like
> 
> <!-- InstanceEnd -->
> </html>
> Path-Name: http://www.site.com/index.htm
> 
> Is it doing this because some of the pages don't end off with a new
> line, or has this got to do with page encoding or this multi-byte issue,
> I've seen mentioned.
> 
> As nothing has been changed on the server, it must be an issue with some
> of the web pages?
> 
> Am stuck - please help. Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Users mailing list
> Users(at)not-real.lists.swish-e.org
> http://lists.swish-e.org/listinfo/users

_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Oct 07 2011 - 14:01:35 GMT