Skip to main content.
home | support | download

Back to List Archive

Re: spider a database

From: Michael Porcaro <music(at)not-real.recordhall.com>
Date: Sat Nov 05 2005 - 04:07:43 GMT
When I use this command to spider my site,

Swish-e -S http -I http://www.youngcomposers.com

It takes awhile to spider.  I think I would have to wait about a month
for it to finish everything at that rate.  It seems to print a neater
temp file though, but there seems to be no way to configure this
(example, can't seem to use a swish.conf file)

Yet, when I use this command

Swish-e -S -c swish.conf

Where swish.conf equals:

    IndexDir spider.pl
    IndexOnly .html
    SwishProgParameters default http://www.youngcomposers.com
    Metanames swishtitle swishdocpath
    StoreDescription TXT* 10000
    StoreDescription HTML* <body> 10000
    FuzzyIndexingMode Stemming_en

I can configure it, but it seems to print out garbage in the temp files,
and the temp files seem to blow up.  It also seems to take awhile to
index.

Now you mentioned that swish-e -S http -I http://www.mysite.com is
depreciated, but it is better to use than the following method.  I am
not quite sure I follow.  What is the common way to spider a site?  I'm
confused which method to use.  By the way, I was confused when I said I
wanted to spider a database.  Both the methods I mention seem to spider
my whole site.

How long does it typically take to spider a site that has about 90,000
pages?

-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
Sent: Friday, November 04, 2005 3:28 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: spider a database

On Fri, Nov 04, 2005 at 03:16:27PM -0500, Michael Porcaro wrote:
> Please bear with me here and thank you for your patience.  I looked at
> your link and searched around.  By searching, I assume that swish-e
can
> spider databases, I wasn't really sure about this before.  I came
across
> this document.  Is this the right thing to read, in order to figure
out
> how to spider my dynamic pages?

Sorry, I was confused as I thought you wanted to index docs in a
database without using http.  Which is it?

If you want to index stuff in a database then search for the MySQL.pl
file in the swish-e distribution.

 
http://cvs.sourceforge.net/viewcvs.py/swishe/swish-e/prog-bin/MySQL.pl?r
ev=1.2&view=auto

> Also, I am confused as to where I should direct the config file to
> spider the dynamic links.  Let's say I want to spider this particular
> file:
> 
> http://www.youngcomposers.com/forum/Piano-Music-f50.html

How does the spider, of anyone for that matter, if that's a static
file or a dynamically generated file?

> Piano-Music-f50.html is actually a php generated file with an html
> alias, but I don't know where to direct swish-e to spider this file.

I have no idea what an html alias is in that context, but you point
the spider to the same place you would point anyone else.  To its url.


> When I spider the files under /home/yc/www/forum (my local site for
> www.youngcomposers.com), all it does is spider the files that run the
> forum, not the actual content dynamic pages, such as
> "Piano-Music-f50.html" or equivalently
> http://www.youngcomposers.com/forum/index.php?showforum=50

The term "spider" implies you are spidering your web site, most likely
with the oddly named program "spider.pl".  That would be spidering
like google does -- by accessing your documents via the web.

Please go back and look at the docs again.

http://swish-e.org/docs/install.html#general_configuration_and_usage

http://swish-e.org/docs/install.html#spidering_and_searching_with_a_web_
form_

http://swish-e.org/docs/spider.html


> So I guess my basic question would be, what is the address of my
dynamic
> files?  A very poor guess is, my database files are located here:
> 
> /var/lib/mysql/
> 
> But is this the address to spider?  Or do I spider /home/yc/www/forum
> instead?  

Maybe better is someone else answers that one.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Nov 4 20:07:54 2005