Skip to main content.
home | support | download

Back to List Archive

RE: Problem on Parser with TXT/HTML and Spider.pl

From: Robert Keith <Robert(at)not-real.Technolords.com>
Date: Wed Apr 30 2003 - 03:21:19 GMT
Also:  To answer your question (I missed this):
> Just to narrow things down, if you save the output from spider.pl
> to a file does it contain
> the header to set the parser type?  That is, is spider.pl adding a
>
>    Document-Type:
>

When the system is indexing INCORRECTLY, the Document-Type header is being
set on the documents that are being FILTERED, 2 MSWord Docs, and one Excel
file.

Path-Name: http://www.intellivence.com/downloads/example.doc
Content-Length: 105
Last-Mtime: 1051432467
Document-Type: TXT*

Path-Name: http://www.intellivence.com/downloads/example.xls
Content-Length: 505
Last-Mtime: 1051432469
Document-Type: HTML*

Path-Name: http://www.intellivence.com/downloads/bigdoc.doc
Content-Length: 105
Last-Mtime: 1051697435
Document-Type: TXT*

But not on any other documents or HTML files.

When spider.pl is run with the BASE_URL reversed, Document-Type is never
set.

Robert Keith



> -----Original Message-----
> From: Bill Moseley [mailto:moseley@hank.org]
> Sent: Tuesday, April 29, 2003 4:14 PM
> To: Robert Keith
> Cc: swish-e@sunsite.berkeley.edu
> Subject: Re: [SWISH-E] Problem on Parser with TXT/HTML and Spider.pl
>
>
> On Tue Apr 29, 2003 at 03:30:23PM -0700, Robert Keith wrote:
> >
> > I am having a strange problem indexing a combination of MSWord,
> .txt and PHP
> > documents using spider.pl and feeding this into swish-e.  If I
> index the PHP
> > urls first, the documents are parsed and loaded as HTML.  If I
> select the
> > MSWord and other documents, which are filtered by the spider.pl filter
> > routines, the MSWord and other documents are parsed as TXT
> (correctly), then
> > when the subsequent PHP and HTML documents are parsed, they are
> parsed as
> > TXT.  The SwishSpiderConfig.pl file contains two entries, the
> URL with the
> > MSWord links, and the URL with only PHP links.
>
> Just to narrow things down, if you save the output from spider.pl
> to a file does it contain
> the header to set the parser type?  That is, is spider.pl adding a
>
>    Document-Type:
>
> header?  I think that code is new, so I'm not sure what you are
> using.  And if so can you
> check between the two indexing methods if they are set incorrectly?
>
> You can also turn on DEBUG_HEADERS ( debug => DEBUG_HEADERS ) in
> the config and watch what
> content-type is being returned.
>
> If it's not setting that header then we need to look at how swish
> is selecting the parser
> (which is based on extension as set by IndexContents and DefaultContents.
>
>
Received on Wed Apr 30 03:28:26 2003