Skip to main content.
home | support | download

Back to List Archive

Re: error indexing pdf files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Apr 15 2003 - 20:14:42 GMT
On Tue, 15 Apr 2003, Jody Cleveland wrote:

> I spoke with the person in charge of the site, and she wants a seperate
> search page for each directory.

Sounds like she already knows where everything is.  Someone might have see
a page on their site but cannot remember where -- so it's nice to have a
top-level search.  Look at perl.apache.org.  I think it's too detailed but
it searches for the current section, but you can override.

Again, I'd probably index the site all at once, then use metanames based
on the path to limit to areas of the site.

> Here's the main chunk of my SwishSpider.pl:
> @servers = (
> 
>         max_depth       => 10,         # spider only ten levels deep

BTW - That's ten levels of recursion, IIRC.  Doesn't really mean
/1/2/3/4/5/6/7/8/9/10 path segments.

> As it is now, it searches all of www.oshkoshpubliclibrary.org. Now, I've got
> /citydirs/ in the test_url part. For the base_url, should that stop at .org,
> or should it also contain /citydirs/? And, am I missing something else, or
> have a typo? Either way I do it, it still indexes everything at the site.

I don't see that:

 -- Starting to spider: http://www.oshkoshpubliclibrary.org/citydirs/ --
>> +Fetched 0 Cnt: 1 http://www.oshkoshpubliclibrary.org/citydirs/ 200 OK text/html 14165 parent:
-Skipped http://www.oshkoshpubliclibrary.org/Welcome.html due to 'test_url' user supplied function #1
-Skipped http://www.oshkoshpubliclibrary.org/internetguides/1_databases.html due to 'test_url' user supplied function #1
.
>> -Failed 1 Cnt: 2 http://www.oshkoshpubliclibrary.org/citydirs/services_index.html 404 Object Not Found text/html 4040 parent:http://www.oshkoshpubliclibrary.org/citydirs/
sleeping 5 seconds

It's interesting that IIS closes the keep alive connection on a 404 error.


-- 
Bill Moseley moseley@hank.org
Received on Tue Apr 15 20:18:26 2003