Skip to main content.
home | support | download

Back to List Archive

Re:

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Nov 25 2003 - 22:23:54 GMT
On Tue, Nov 25, 2003 at 01:45:05PM -0800, Kissman, Paul (BLC) wrote:
> I am using the Swish-E (version 2.4.0) prog method along with the
> supplied spider script and the Filters Option to index both html and pdf
> files (and Word files) on our web site. I've got my SwishSpiderConfig
> file working; everything is fine in all regards but one.  Most of my
> website's html pages use server side includes, and I am not getting any
> lastmodified date information for these shtml files.

 http://httpd.apache.org/docs/howto/ssi.html.html

look for xbithack


> =20
> 
> After some digging around I find out that the LWP package can't find the
> file modification date because the full page is generated dynamically
> through http and the filesystem modification timestamp for the main part
> of the page is not available to it.

Kind of.  See above.

> 
> =20
> 
> I was thinking that one could insert a function in spider.pl that would
> quickly map the URL to the actual file, then go out and grab the actual
> file's timestamp if the web page were on the local server, and then
> stuff it in as the "Last-Mtime" value in the $headers string that gets
> returned to the indexer.

Sure if your httpd.conf is simply a one-to-one mapping to DocumentRoot.

> 
> =20
> 
> Is this a reasonable approach? Has anyone done this or solved this
> problem a different way?

Well (for fun), you could try this, but I would not recommend it.  Say your
DocumentRoot is /var/www

  spider.pl default file:///var/www/index.html

-- 
Bill Moseley
moseley@hank.org
Received on Tue Nov 25 22:23:59 2003