the documentation is not clear on this point, but you should not use -S http.
Instead, use the spider.pl script with the -S prog option. Much better
performance and control. See the spider.pl perldoc.
Michael Porcaro scribbled on 11/4/05 10:05 PM:
> When I use this command to spider my site,
>
> Swish-e -S http -I http://www.youngcomposers.com
>
> It takes awhile to spider. I think I would have to wait about a month
> for it to finish everything at that rate. It seems to print a neater
> temp file though, but there seems to be no way to configure this
> (example, can't seem to use a swish.conf file)
>
> Yet, when I use this command
>
> Swish-e -S -c swish.conf
>
> Where swish.conf equals:
>
> IndexDir spider.pl
> IndexOnly .html
> SwishProgParameters default http://www.youngcomposers.com
> Metanames swishtitle swishdocpath
> StoreDescription TXT* 10000
> StoreDescription HTML* <body> 10000
> FuzzyIndexingMode Stemming_en
>
> I can configure it, but it seems to print out garbage in the temp files,
> and the temp files seem to blow up. It also seems to take awhile to
> index.
>
> Now you mentioned that swish-e -S http -I http://www.mysite.com is
> depreciated, but it is better to use than the following method. I am
> not quite sure I follow. What is the common way to spider a site? I'm
> confused which method to use. By the way, I was confused when I said I
> wanted to spider a database. Both the methods I mention seem to spider
> my whole site.
>
> How long does it typically take to spider a site that has about 90,000
> pages?
>
> -----Original Message-----
> From: swish-e@sunsite3.berkeley.edu
> [mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
> Sent: Friday, November 04, 2005 3:28 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: spider a database
>
> On Fri, Nov 04, 2005 at 03:16:27PM -0500, Michael Porcaro wrote:
>
>>Please bear with me here and thank you for your patience. I looked at
>>your link and searched around. By searching, I assume that swish-e
>
> can
>
>>spider databases, I wasn't really sure about this before. I came
>
> across
>
>>this document. Is this the right thing to read, in order to figure
>
> out
>
>>how to spider my dynamic pages?
>
>
> Sorry, I was confused as I thought you wanted to index docs in a
> database without using http. Which is it?
>
> If you want to index stuff in a database then search for the MySQL.pl
> file in the swish-e distribution.
>
>
> http://cvs.sourceforge.net/viewcvs.py/swishe/swish-e/prog-bin/MySQL.pl?r
> ev=1.2&view=auto
>
>
>>Also, I am confused as to where I should direct the config file to
>>spider the dynamic links. Let's say I want to spider this particular
>>file:
>>
>>http://www.youngcomposers.com/forum/Piano-Music-f50.html
>
>
> How does the spider, of anyone for that matter, if that's a static
> file or a dynamically generated file?
>
>
>>Piano-Music-f50.html is actually a php generated file with an html
>>alias, but I don't know where to direct swish-e to spider this file.
>
>
> I have no idea what an html alias is in that context, but you point
> the spider to the same place you would point anyone else. To its url.
>
>
>
>>When I spider the files under /home/yc/www/forum (my local site for
>>www.youngcomposers.com), all it does is spider the files that run the
>>forum, not the actual content dynamic pages, such as
>>"Piano-Music-f50.html" or equivalently
>>http://www.youngcomposers.com/forum/index.php?showforum=50
>
>
> The term "spider" implies you are spidering your web site, most likely
> with the oddly named program "spider.pl". That would be spidering
> like google does -- by accessing your documents via the web.
>
> Please go back and look at the docs again.
>
> http://swish-e.org/docs/install.html#general_configuration_and_usage
>
> http://swish-e.org/docs/install.html#spidering_and_searching_with_a_web_
> form_
>
> http://swish-e.org/docs/spider.html
>
>
>
>>So I guess my basic question would be, what is the address of my
>
> dynamic
>
>>files? A very poor guess is, my database files are located here:
>>
>>/var/lib/mysql/
>>
>>But is this the address to spider? Or do I spider /home/yc/www/forum
>>instead?
>
>
> Maybe better is someone else answers that one.
>
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Fri Nov 4 20:09:56 2005