Skip to main content.
home | support | download

Back to List Archive

Re: spider a database

From: Dave Hane <dhane(at)not-real.pw.usda.gov>
Date: Sat Nov 05 2005 - 15:44:49 GMT
On Friday 04 November 2005 21:07, Michael Porcaro wrote:
> When I use this command to spider my site,
>
> Swish-e -S http -I http://www.youngcomposers.com
>
> It takes awhile to spider.  I think I would have to wait about a month
> for it to finish everything at that rate.  It seems to print a neater
> temp file though, but there seems to be no way to configure this
> (example, can't seem to use a swish.conf file)
>
> Yet, when I use this command
>
> Swish-e -S -c swish.conf
>
> Where swish.conf equals:
>
>     IndexDir spider.pl
>     IndexOnly .html
>     SwishProgParameters default http://www.youngcomposers.com
>     Metanames swishtitle swishdocpath
>     StoreDescription TXT* 10000
>     StoreDescription HTML* <body> 10000
>     FuzzyIndexingMode Stemming_en
>
> I can configure it, but it seems to print out garbage in the temp files,
> and the temp files seem to blow up.  It also seems to take awhile to
> index.
>
> Now you mentioned that swish-e -S http -I http://www.mysite.com is
> depreciated, but it is better to use than the following method.  I am
> not quite sure I follow.  What is the common way to spider a site?  I'm
> confused which method to use.  By the way, I was confused when I said I
> wanted to spider a database.  Both the methods I mention seem to spider
> my whole site.
>
> How long does it typically take to spider a site that has about 90,000
> pages?
>
> -----Original Message-----
> From: swish-e@sunsite3.berkeley.edu
> [mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
> Sent: Friday, November 04, 2005 3:28 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: spider a database
>
> On Fri, Nov 04, 2005 at 03:16:27PM -0500, Michael Porcaro wrote:
> > Please bear with me here and thank you for your patience.  I looked at
> > your link and searched around.  By searching, I assume that swish-e
>
> can
>
> > spider databases, I wasn't really sure about this before.  I came
>
> across
>
> > this document.  Is this the right thing to read, in order to figure
>
> out
>
> > how to spider my dynamic pages?
>
> Sorry, I was confused as I thought you wanted to index docs in a
> database without using http.  Which is it?
>
> If you want to index stuff in a database then search for the MySQL.pl
> file in the swish-e distribution.
>
>
> http://cvs.sourceforge.net/viewcvs.py/swishe/swish-e/prog-bin/MySQL.pl?r
> ev=1.2&view=auto
>
> > Also, I am confused as to where I should direct the config file to
> > spider the dynamic links.  Let's say I want to spider this particular
> > file:
> >
> > http://www.youngcomposers.com/forum/Piano-Music-f50.html
>
> How does the spider, of anyone for that matter, if that's a static
> file or a dynamically generated file?
>
> > Piano-Music-f50.html is actually a php generated file with an html
> > alias, but I don't know where to direct swish-e to spider this file.
>
> I have no idea what an html alias is in that context, but you point
> the spider to the same place you would point anyone else.  To its url.
>
> > When I spider the files under /home/yc/www/forum (my local site for
> > www.youngcomposers.com), all it does is spider the files that run the
> > forum, not the actual content dynamic pages, such as
> > "Piano-Music-f50.html" or equivalently
> > http://www.youngcomposers.com/forum/index.php?showforum=50
>
> The term "spider" implies you are spidering your web site, most likely
> with the oddly named program "spider.pl".  That would be spidering
> like google does -- by accessing your documents via the web.
>
> Please go back and look at the docs again.
>
> http://swish-e.org/docs/install.html#general_configuration_and_usage
>
> http://swish-e.org/docs/install.html#spidering_and_searching_with_a_web_
> form_
>
> http://swish-e.org/docs/spider.html
>
> > So I guess my basic question would be, what is the address of my
>
> dynamic
>
> > files?  A very poor guess is, my database files are located here:
> >
> > /var/lib/mysql/
> >
> > But is this the address to spider?  Or do I spider /home/yc/www/forum
> > instead?
>
> Maybe better is someone else answers that one.

Michael,

I currently use swish-e to spider my mysql database. All of the pages on my 
site are dynamically created via cgi scripts. Because of this I found that 
writing a custom perl script to query the database, build the dynamic pages, 
and then pass the output along to swish for indexing was the best way to go. 
Not to mention that it was also several orders of magnitude faster than 
trying to spider 2 million+ records.

Swish-e has great documentation for this. Try this link:
http://swish-e.org/docs/swish-config.html#directives_for_the_prog_access_method_only

Of course you'll need to know some sort of programming language, but after you 
get a working program you could run swish-e in a manner similar to this:

/usr/local/bin/swish-e -S prog -i /path/to/your/custom/script 
-c /path/to/swish.conf

I also use only the most basic swish.conf file because you have so much more 
control of things when you use your own script.

Of course, one of the nice things about swish-e is that you can do the same 
thing several different ways... You could probably use your existing php 
scripts (the ones that generate your dynamic pages) and just write a simple 
go between script. I did that at first for my site with a simple 2 line perl 
script.

Once you get a working system then you'll need to get it going faster. I used 
speedyCGI to increase my speed from 12 minutes for 1500 records to 45 seconds 
for the same records. Now I do all 2 million+ records in about 10 hours and 
I'm still making it faster :-)

Good luck,

Dave
Received on Sat Nov 5 07:45:09 2005