Skip to main content.
home | support | download

Back to List Archive

Re: search only .html and no extension files

From: Michael Porcaro <music(at)not-real.recordhall.com>
Date: Wed Nov 09 2005 - 07:55:46 GMT
Great, after an agonizing week of research of swish-e, I think I almost
have the basics to spider a site, but not quite yet.  Let me just
explain what was confusing for me, and most likely other newbies, and
hopefully I am now on the right track.  You don't have to answer my
rhetorical questions or comments, maybe just a "your on the right track,
or your totally off":

1.  Using a config file (-c config) doesn't seem to work well when using
the -S prog command.  I also noticed you can't run a config file and a
SwishSpiderConfig.pl at the same time.  Is this true?  There really is
no point in doing this anyway.

2.  You need the -S prog command when you choose to use spider.pl.  When
indexing with spider.pl, you don't need swish.conf, (not sure if you CAN
use it though) you DO need SwishSpiderConfig.pl.  Use this command:
swish-e -S prog -I spider.pl and SwishSpiderConfig.pl will be called
automatically, as long as that file is in the same directory.

3.  .swishcgi.conf was REALLY confusing me.  This file apparently isn't
used for the spidering process, it seems to be used AFTER the process is
done, to control how many searches per page, the title, etc.  Correct?

4.  SwishSpiderConfig.pl replaces a config file and is more efficient.
It is written in perl, so it has more complex coding, but more power and
control.  SwishSpiderConfig.pl is simply the config file for spider.pl.

My question now is regarding the test_url function.  Basically, I am
interested in only spidering html and non extension files.  Here is an
example of a non extension file:
http://www.youngcomposers.com/articles/History-of-Young-Composers

I tried this command in SwishSpiderConfig.pl which was in your
documentation:

test_url    => sub { $_[0]->path =~ /\.html?$/ },

But it doesn't seem to work.  It keeps saying error, no files were
indexed.  When I comment this file out, the spidering does work, so
there seems to be a problem with that line of code.  Any suggestions?
Are there other ways to "index only html" or is test_url the best way to
do this?

-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
Sent: Tuesday, November 08, 2005 11:44 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: search only .html and no extension files

On Tue, Nov 08, 2005 at 12:49:31PM -0800, Michael Porcaro wrote:
> Hi,
> 
> Question 1:  
> Lets say I add a new page.  Do I have to spider the whole site again
to
> index the 1 page?

Mostly, yes.


> 
> Question 2:
> I finally was able to spider my site, and get the search engine to
work.
> One problem now:
> 
> The spider indexed every single link when I instructed it to index
html
> by using this config file called swish.conf
> 
> # Use spider.pl for indexing 
> IndexDir spider.pl
> IndexOnly .html

IndexOnly isn't used when using -S prog input method (i.e. using
spider.pl).


> 
> It took about 7 hours to spider the whole site with this command:
> 
> Swish-e -e -S prog -c swish.conf
> 
> There are a lot of useless links in the index file which is 80 megs.
> How can I filter out every page except .html?  How come it didn't obey
> the config file?

http://swish-e.org/docs/spider.html should cover most of that.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Nov 8 23:55:47 2005