Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] index a list of files

From: Brad Bauer <bbauer(at)not-real.telstate.com>
Date: Wed Jul 09 2008 - 13:47:13 GMT
Bill,

Thanks for your reply.

Yes, much of the content is dynamically generated.  Additionally, we've been
using an increasing amount of includes on the static pages, which is leaving
more and more of the site un-indexed.  And as mentioned previously we are
finding some pdfs ending up getting indexed which have no links to them on
the site.  By spidering we hope to resolve most of these issues.

Perhaps there is something else at play slowing it down.  While trying to
get the spider working I reduced the SwishSpiderConfig.pl settings to a bare
minimum, so any timings are at their default.  What are the default timings
the spider uses?  Can you recommend good options for the timing related
settings?

I'll look into modifying spider.pl, but I am no perl guru so I might take an
easier route: I am thinking I can just adjust SwishSpiderConfig.pl#test_url
to append each .pdf URL it encounters to a log file and return false for
that file.  Then I will probably modify file.pl (since it is such a simple
file) to index the pdfs saved in the log file.  Do you see any potential
issues with that?


B Bauer

-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of Bill Moseley
Sent: Wednesday, July 09, 2008 1:01 AM
To: Swish-e Users Discussion List
Subject: Re: [swish-e] index a list of files

On Tue, Jul 08, 2008 at 10:34:29PM -0400, Brad Bauer wrote:
> 
> RE: Caching - I am attempting to avoid downloading pdfs since it is 
> very time consuming compared to the fs method. (They do, after all, 
> already exist on the server)  Using the spider is taking 20+ minutes 
> for only a small section of the site, where as using the fs setup I am 
> able to index the entire server in about 5 minutes.

The web server is running on the same machine?  That seems hard to believe
that the web server would be that much slower at fetching the files to make
a difference.  I'd think most of the time (for either
mode) would be extracting the pdf and indexing.  Fetching over http vs. the
file system would seem like background noise.

But, I'm just guessing.

Nice thing about the -s prog method is you can, well, you can write a
program do do your indexing.  So you could use the spider and when the
spider returns a link to a pdf you could abort the fetch and grab the
content from the local disk.  Might take a bit of tweaking of the spider,
but it's very possible and not too hard.

Is your content dynamically generated?  Is that why you are spidering
instead of "spidering" the file system?

--
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Jul 9 09:47:17 2008