Also: To answer your question (I missed this):
> Just to narrow things down, if you save the output from spider.pl
> to a file does it contain
> the header to set the parser type? That is, is spider.pl adding a
>
> Document-Type:
>
When the system is indexing INCORRECTLY, the Document-Type header is being
set on the documents that are being FILTERED, 2 MSWord Docs, and one Excel
file.
Path-Name: http://www.intellivence.com/downloads/example.doc
Content-Length: 105
Last-Mtime: 1051432467
Document-Type: TXT*
Path-Name: http://www.intellivence.com/downloads/example.xls
Content-Length: 505
Last-Mtime: 1051432469
Document-Type: HTML*
Path-Name: http://www.intellivence.com/downloads/bigdoc.doc
Content-Length: 105
Last-Mtime: 1051697435
Document-Type: TXT*
But not on any other documents or HTML files.
When spider.pl is run with the BASE_URL reversed, Document-Type is never
set.
Robert Keith
> -----Original Message-----
> From: Bill Moseley [mailto:moseley@hank.org]
> Sent: Tuesday, April 29, 2003 4:14 PM
> To: Robert Keith
> Cc: swish-e@sunsite.berkeley.edu
> Subject: Re: [SWISH-E] Problem on Parser with TXT/HTML and Spider.pl
>
>
> On Tue Apr 29, 2003 at 03:30:23PM -0700, Robert Keith wrote:
> >
> > I am having a strange problem indexing a combination of MSWord,
> .txt and PHP
> > documents using spider.pl and feeding this into swish-e. If I
> index the PHP
> > urls first, the documents are parsed and loaded as HTML. If I
> select the
> > MSWord and other documents, which are filtered by the spider.pl filter
> > routines, the MSWord and other documents are parsed as TXT
> (correctly), then
> > when the subsequent PHP and HTML documents are parsed, they are
> parsed as
> > TXT. The SwishSpiderConfig.pl file contains two entries, the
> URL with the
> > MSWord links, and the URL with only PHP links.
>
> Just to narrow things down, if you save the output from spider.pl
> to a file does it contain
> the header to set the parser type? That is, is spider.pl adding a
>
> Document-Type:
>
> header? I think that code is new, so I'm not sure what you are
> using. And if so can you
> check between the two indexing methods if they are set incorrectly?
>
> You can also turn on DEBUG_HEADERS ( debug => DEBUG_HEADERS ) in
> the config and watch what
> content-type is being returned.
>
> If it's not setting that header then we need to look at how swish
> is selecting the parser
> (which is based on extension as set by IndexContents and DefaultContents.
>
>
Received on Wed Apr 30 03:28:26 2003