This may sound like a stupid reply if you have already checked this but I was
experiencing a similar problem. I found the solution in the list archives.
A compile time option must be set to the depth of your TITLE.
in config.h:
#define TITLETOPLINES 30
/* This is how many lines deep SWISH will look into an HTML file to
** attempt to find a <TITLE> tag.
*/
The default is like 7. It may slow things down a little but 30 seems to be a
good number.
----
John Leth-Nissen
Web Developer
Gulfstream Aerospace Corporation
David Norris wrote:
> OK, I think this makes some sense.
>
> If I index http://www.misma.org/contact.html using the spider the TITLE is
> set to "contact.html" in the swish index file.
> HTTP Headers:
> HTTP/1.1 200 OK
> Date: Mon, 31 May 1999 10:22:55 GMT
> Server: Apache/1.2.5
> X-Server-CGI: PHP/3.0.7
> X-Resource-Indicator:
> X-Resource-Modified: 923650015
> Expires: Tue, 01 Jun 1999 10:22:55 GMT
> Cache-Control: post-check=43200,pre-check=86400
> Last-Modified: 1999-04-09T09:26:55Z
> Connection: close
> Content-Type: text/html; charset=iso-8859-1
>
> If I index http://localhost/test/contact.html using the spider the TITLE is
> set to "Contacts - MiSMA..."
> HTTP Headers:
> HTTP/1.1 200 OK
> Date: Mon, 31 May 1999 10:21:54 GMT
> Server: Apache/1.3.6 (Win32)
> Parser: PHP/3.0.6 (Win32)
> Connection: close
> Content-Type: text/html
>
> If I index /my_documents/test/contact.html using file system the TITLE is
> set to "Contacts - MiSMA..."
> No HTTP Header Equivalents.
>
> This is exactly the same file in all three cases. Line feed is Unix LF in
> all three cases. I sorta hacked my copy of the swishspider to force it to
> index text/html; charset=iso-8859-1. That appears to be the only major
> difference which could have an effect on the parsing. Something, somewhere
> doesn't recognize that it should be parsing that document with the HTML
> parser. There is some other code somewhere that assumes anything not
> exactly text/html isn't HTML. Forcing the spider to index the contents of
> text/html; charset=... isn't enough.
>
> So, to test this theory I changed my content-type header on the misma.org
> server. Sure enough, the titles are now indexed correctly. So, this
> appears to be the Content-Type 'feature' of that old PERL module.
>
> I don't know if this helps anyone else. But, I can, at least, hack
> something to change my content-type header when swishspider visits a
> document until someone figures this out.
>
> ,David Norris
>
> World Wide Web - http://www.geocities.com/CapeCanaveral/Lab/1652/
> Home Computer - http://illusionary.tzo.cc/
> Page via mail - 412039@pager.mirabilis.com
> ICQ Universal Internet Number - 412039
> E-Mail - kg9ae@geocities.com
Received on Mon May 31 07:04:03 1999