Skip to main content.
home | support | download

Back to List Archive

(no subject)

From: Kissman, Paul (BLC) <Paul.Kissman(at)not-real.state.ma.us>
Date: Tue Nov 25 2003 - 21:45:09 GMT
I am using the Swish-E (version 2.4.0) prog method along with the
supplied spider script and the Filters Option to index both html and pdf
files (and Word files) on our web site. I've got my SwishSpiderConfig
file working; everything is fine in all regards but one.  Most of my
website's html pages use server side includes, and I am not getting any
lastmodified date information for these shtml files.

=20

After some digging around I find out that the LWP package can't find the
file modification date because the full page is generated dynamically
through http and the filesystem modification timestamp for the main part
of the page is not available to it.

=20

I was thinking that one could insert a function in spider.pl that would
quickly map the URL to the actual file, then go out and grab the actual
file's timestamp if the web page were on the local server, and then
stuff it in as the "Last-Mtime" value in the $headers string that gets
returned to the indexer.

=20

Is this a reasonable approach? Has anyone done this or solved this
problem a different way?

=20

I know I could walk through the filesystem and not spider my site, but I
wanted to take advantage of all the work with Filters that has been
done. This seems to be available only using spidering at this point.

=20

Any help would be appreciated.

Paul J. Kissman
Library Information Systems Specialist
Massachusetts Board of Library Commissioners
648 Beacon St.
Boston, MA  02215
paul.kissman@state.ma.us
www.mlin.lib.ma.us or www.mlin.org
617-267-9400 * 800-952-7403 (in-state)
Fax: 617-421-9833

=20




*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Tue Nov 25 21:56:56 2003