Skip to main content.
home | support | download

Back to List Archive

Re: Trying to index Lotus Notes Domino server

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Feb 04 2003 - 18:54:51 GMT
On Tue, 4 Feb 2003, Krueger, Tom wrote:

> Thanks Bill and OK.
> I am now using spider.pl and I did this test with this outcome.
> 
> [admin@swish swishindexes]$ spider.pl default
> http://username:password@hostname.xiotech.com/default.nsf | swish-e -S prog
> -i stdin
> Indexing Data Source: "External-Program"
> Indexing "stdin"
> /usr/local/bin/spider.pl: Reading parameters from 'default'

> http://hostname.xiotech.com/global2.nsf/web%20search%20simple?OpenForm  !=
> (text/html text/plain)

When you use the "default" settings it reports pages that are not
text/html or text/plain content types.  (You can see the "default" setting
be looking at spider.pl and searching for "sub default_urls".)

Is that the complete error?  The line that generates that error should be:

 print STDERR "$_[0] $content_type != (@content_types)\n";

so if that's the case it seems like a very strange content type that your
server is returning, or perhaps the server is not returning any content
type.  What web server are you running?

You can use the perl LWP HEAD command (if installed) to see what the
content type is, or just use telnet.


> Warning: Unknown header line: 'HTML>' from program stdin
> 
> Warning: Unknown header line: 'HTML>' from program stdin

That's odd.  That means that the content-length header sent to swish is
probably wrong.

spider.pl is a "-S prog" type of program and it outputs a header that
gives the name of the file, a date (unix timestamp), the content-length
then a blank line followed by the content.  Swish uses that header to know
how much to read in (where the next file should be).  If the
content-length that spider.pl sets is somehow wrong then you will see that
type of error.

You can just run the spider like:

  ./spider.pl default http://hostname.xiotech.com/default.nsf > out.log

then you can look at out.log to see exactly what swish-e is attempting to
parse.


> Need Authentication for http://inside.xiotech.com/names.nsf at realm '/'
> (<Enter> skips)

That's just telling you that a username and password is needed.  Although
I don't see how it's accessing a different host name (inside.xiotech.com)
than what you started with -- the host names should match.


spider.pl has a number of debugging options you can turn on and it will
report the URL it's fetching and the response headers the server returned.
That might be helpful to see what's happening.

-- 
Bill Moseley moseley@hank.org
Received on Tue Feb 4 18:55:21 2003