Skip to main content.
home | support | download

Back to List Archive

Re: spider a database

From: Michael Porcaro <music(at)not-real.recordhall.com>
Date: Sat Nov 05 2005 - 19:11:28 GMT
Ok I think I am understanding now. I was confused because I didn't
realize there are 2 different configuration files.  One for parameters
which is much simpler (swish.conf) and another for spider.pl, which
requires perl knowledge (a perl config file).  So there are 2 config
files, am I correct on this?

Finally, where is this custom config perl file supposed to go?  Under
what directory?  I tried running it in my cgi-bin (local website) but it
didn't work.  I am guessing the script has to go somewhere else, unlike
the regular swish.conf file (which I could just keep in my cgi-bin.

-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Dave Hane
Sent: Saturday, November 05, 2005 10:43 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: spider a database

On Friday 04 November 2005 21:07, Michael Porcaro wrote:
> When I use this command to spider my site,
>
> Swish-e -S http -I http://www.youngcomposers.com
>
> It takes awhile to spider.  I think I would have to wait about a month
> for it to finish everything at that rate.  It seems to print a neater
> temp file though, but there seems to be no way to configure this
> (example, can't seem to use a swish.conf file)
>
> Yet, when I use this command
>
> Swish-e -S -c swish.conf
>
> Where swish.conf equals:
>
>     IndexDir spider.pl
>     IndexOnly .html
>     SwishProgParameters default http://www.youngcomposers.com
>     Metanames swishtitle swishdocpath
>     StoreDescription TXT* 10000
>     StoreDescription HTML* <body> 10000
>     FuzzyIndexingMode Stemming_en
>
> I can configure it, but it seems to print out garbage in the temp
files,
> and the temp files seem to blow up.  It also seems to take awhile to
> index.
>
> Now you mentioned that swish-e -S http -I http://www.mysite.com is
> depreciated, but it is better to use than the following method.  I am
> not quite sure I follow.  What is the common way to spider a site?
I'm
> confused which method to use.  By the way, I was confused when I said
I
> wanted to spider a database.  Both the methods I mention seem to
spider
> my whole site.
>
> How long does it typically take to spider a site that has about 90,000
> pages?
>
> -----Original Message-----
> From: swish-e@sunsite3.berkeley.edu
> [mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
> Sent: Friday, November 04, 2005 3:28 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: spider a database
>
> On Fri, Nov 04, 2005 at 03:16:27PM -0500, Michael Porcaro wrote:
> > Please bear with me here and thank you for your patience.  I looked
at
> > your link and searched around.  By searching, I assume that swish-e
>
> can
>
> > spider databases, I wasn't really sure about this before.  I came
>
> across
>
> > this document.  Is this the right thing to read, in order to figure
>
> out
>
> > how to spider my dynamic pages?
>
> Sorry, I was confused as I thought you wanted to index docs in a
> database without using http.  Which is it?
>
> If you want to index stuff in a database then search for the MySQL.pl
> file in the swish-e distribution.
>
>
>
http://cvs.sourceforge.net/viewcvs.py/swishe/swish-e/prog-bin/MySQL.pl?r
> ev=1.2&view=auto
>
> > Also, I am confused as to where I should direct the config file to
> > spider the dynamic links.  Let's say I want to spider this
particular
> > file:
> >
> > http://www.youngcomposers.com/forum/Piano-Music-f50.html
>
> How does the spider, of anyone for that matter, if that's a static
> file or a dynamically generated file?
>
> > Piano-Music-f50.html is actually a php generated file with an html
> > alias, but I don't know where to direct swish-e to spider this file.
>
> I have no idea what an html alias is in that context, but you point
> the spider to the same place you would point anyone else.  To its url.
>
> > When I spider the files under /home/yc/www/forum (my local site for
> > www.youngcomposers.com), all it does is spider the files that run
the
> > forum, not the actual content dynamic pages, such as
> > "Piano-Music-f50.html" or equivalently
> > http://www.youngcomposers.com/forum/index.php?showforum=50
>
> The term "spider" implies you are spidering your web site, most likely
> with the oddly named program "spider.pl".  That would be spidering
> like google does -- by accessing your documents via the web.
>
> Please go back and look at the docs again.
>
> http://swish-e.org/docs/install.html#general_configuration_and_usage
>
>
http://swish-e.org/docs/install.html#spidering_and_searching_with_a_web_
> form_
>
> http://swish-e.org/docs/spider.html
>
> > So I guess my basic question would be, what is the address of my
>
> dynamic
>
> > files?  A very poor guess is, my database files are located here:
> >
> > /var/lib/mysql/
> >
> > But is this the address to spider?  Or do I spider
/home/yc/www/forum
> > instead?
>
> Maybe better is someone else answers that one.

Michael,

I currently use swish-e to spider my mysql database. All of the pages on
my 
site are dynamically created via cgi scripts. Because of this I found
that 
writing a custom perl script to query the database, build the dynamic
pages, 
and then pass the output along to swish for indexing was the best way to
go. 
Not to mention that it was also several orders of magnitude faster than 
trying to spider 2 million+ records.

Swish-e has great documentation for this. Try this link:
http://swish-e.org/docs/swish-config.html#directives_for_the_prog_access
_method_only

Of course you'll need to know some sort of programming language, but
after you 
get a working program you could run swish-e in a manner similar to this:

/usr/local/bin/swish-e -S prog -i /path/to/your/custom/script 
-c /path/to/swish.conf

I also use only the most basic swish.conf file because you have so much
more 
control of things when you use your own script.

Of course, one of the nice things about swish-e is that you can do the
same 
thing several different ways... You could probably use your existing php

scripts (the ones that generate your dynamic pages) and just write a
simple 
go between script. I did that at first for my site with a simple 2 line
perl 
script.

Once you get a working system then you'll need to get it going faster. I
used 
speedyCGI to increase my speed from 12 minutes for 1500 records to 45
seconds 
for the same records. Now I do all 2 million+ records in about 10 hours
and 
I'm still making it faster :-)

Good luck,

Dave
Received on Sat Nov 5 11:11:35 2005