Skip to main content.
home | support | download

Back to List Archive

Re: search only .html and no extension files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Nov 09 2005 - 15:14:55 GMT
On Wed, Nov 09, 2005 at 02:53:13AM -0500, Michael Porcaro wrote:
> Great, after an agonizing week of research of swish-e, I think I almost
> have the basics to spider a site, but not quite yet.  Let me just
> explain what was confusing for me, and most likely other newbies, and
> hopefully I am now on the right track.  You don't have to answer my
> rhetorical questions or comments, maybe just a "your on the right track,
> or your totally off":

I think, in general, using swish-e is a bit confusing for new users
because there's more than one way to deal with things.  It's clearly
not a plug-n-play program, but that's one thing that makes it more
useful.  Maybe I'm wrong, but it didn't sound like you spent a lot of
time with the documentation as you asked some clearly documented
questions.

There's no doubt that the documentation could be improved.  Plus
there's a lot of it to try and read all at once.


> 1.  Using a config file (-c config) doesn't seem to work well when using
> the -S prog command.  I also noticed you can't run a config file and a
> SwishSpiderConfig.pl at the same time.  Is this true?  There really is
> no point in doing this anyway.

I have not heard this much.  Once people get that indexing and
spidering are two different tasks then it makes sense.

    http://swish-e.org/docs/swish-faq.html#can_i_index_documents_on_a_web_server_
    http://swish-e.org/docs/swish-faq.html

And this has a break-down of config options used for various input
methods:

    http://swish-e.org/docs/swish-config.html

    # Directives for the File Access method only
    # Directives for the HTTP Access Method Only
    # Directives for the prog Access Method Only

Plus, http://swish-e.org/docs/install.html#general_configuration_and_usage
really does a walk through from basic indexing files on a hard drive
to spidering, giving config examples along the way.  What part of that
was confusing?


> 2.  You need the -S prog command when you choose to use spider.pl.  When
> indexing with spider.pl, you don't need swish.conf, (not sure if you CAN
> use it though) you DO need SwishSpiderConfig.pl.  Use this command:
> swish-e -S prog -I spider.pl and SwishSpiderConfig.pl will be called
> automatically, as long as that file is in the same directory.

Well, the example in the INSTALL file shows using spider.pl with a
swish config file.

    http://swish-e.org/docs/install.html#general_configuration_and_usage

    "~/web_index$ swish-e -S prog -c swish.conf"

You don't *need* one, but you can use one.

You don't *need* SwishSpiderConfig.pl.  That's just the default it
looks for if you don't specify one:


> 3.  .swishcgi.conf was REALLY confusing me.  This file apparently isn't
> used for the spidering process, it seems to be used AFTER the process is
> done, to control how many searches per page, the title, etc.  Correct?

Correct.  It's the conf file for swish.cgi.


> 4.  SwishSpiderConfig.pl replaces a config file and is more efficient.
> It is written in perl, so it has more complex coding, but more power and
> control.  SwishSpiderConfig.pl is simply the config file for spider.pl.

Correct again.  Again, it was named to indicated that.


> My question now is regarding the test_url function.  Basically, I am
> interested in only spidering html and non extension files.  Here is an
> example of a non extension file:
> http://www.youngcomposers.com/articles/History-of-Young-Composers
> 
> I tried this command in SwishSpiderConfig.pl which was in your
> documentation:
> 
> test_url    => sub { $_[0]->path =~ /\.html?$/ },
> 
> But it doesn't seem to work.  It keeps saying error, no files were
> indexed.  When I comment this file out, the spidering does work, so
> there seems to be a problem with that line of code.  Any suggestions?
> Are there other ways to "index only html" or is test_url the best way to
> do this?

This is *clearly* documented in this:

    http://swish-e.org/docs/spider.html#callback_functions


Think about how a spider might work.  It fetches an initial web page,
indexed its content, then it looks at all the links in that document.
You might want to follow some of those links (like to html files) but
others not (say, image files).

How do you know which is which before you actually fetch the file?
You don't.  But, if it's your own site, you can make a guess by
looking at the file name.

test_url is assigned a subroutine that returns true if it's OK to
fetch the file, false if it's not OK to fetch the file.

    test_url    => sub { $_[0]->path =~ /\.html?$/ },

creates a subroutine that returns true if the URL's path ends in
html or .htm.  Do any of your files match that?

What about files that don't have an extension?  Well that's a bit more
tricky.

    test_url => sub {
        my $url = shift;
        return 1 if $url->path =~ /\.html?$/;  # .html or .htm

        # will this work on your site?
        # any files that have a dot are not html:
        return 1 unless $url->path =~ /\./;
        return 0;
    },

Or maybe look at just the last part of the path?

A slower, but better way is to not use test_url, but use
test_response.  Again, this is right out of the docs, so I'm not sure
how you missed it:

    test_response => sub {
        my $content_type = $_[2]->content_type;
        return $content_type =~ m!text/html!;
    },

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Nov 9 07:14:56 2005