Skip to main content.
home | support | download

Back to List Archive

Re: General comment regarding Prog method

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Jan 29 2002 - 15:25:53 GMT
At 06:31 AM 1/29/2002 -0800, Rich Thomas wrote:

>Would anyone else consider it beneficial to have the perl script examples
>made simpler?

Just just don't seem simple because you don't understand them yet ;)

>From the SWISH-RUN documentation is has this example:

    #!/usr/local/bin/perl -w
    use strict;
    
    # Build a document
    my $doc = <<EOF;
    <html>
    <head>
        <title>Document Title</title>
    </head>
        <body>
            This is the text.
        </body>
    </html>
    EOF
    
    
    # Prepare the headers for swish
    my $path = 'Example.file';
    my $size = length $doc;
    my $mtime = time;
    
    # Output the document (to swish)
    print <<EOF;
    Path-Name: $path
    Content-Length: $size
    Last-Mtime: $mtime
    Document-Type: HTML
    
    EOF
    
       
I don't know know how to make it easier.  In the prog-bin directory there
is DirTree.pl, which is a -S fs replacement, and MySQL.pl which goes a bit
more into detail, and then index_hypermail.pl which does more still.  All
designed to gently introduce you to -S prog ;)

>I know that a lot of effort went into putting so many
>parameters and options into things like spider.pl but unless you're a perl
>programmer it's tough to know just what portions you need for which tasks.

Well, spider is complicated.  Spidering is complicated.  Here's the rule
the new swish uses break:  Start with no config file, and add settings as
you need them.  Not the other way around.

So for spider.pl, start with:

IndexDir ./spider.pl
SwishProgParameters default http://mysite.com

My other suggestion:  When you decide you want to use a config file for
spider.pl, grab the SwishSpiderConfig.pl file, and then TRIM it down to
just the config section -- delete all the comments, and the examples you
are not using.


>What would be helpful, at least to me, would be some short simple examples
>of spider.pl or swish.conf files to do things like spider a site.

Did you look at the conf directory?  The swish-e docs are not perfect.  But
the README shoud point to the INSTALL doc.  Everyone reads INSTALL to get
swish installed, right?  After installing, the next two sections in that
INSTALL doc is How to get Help, and Examples of use, which points to the
conf directory for more examples.  The examples in conf are suppose to
guide you from very simple indexing through spidering.


>I'd like
>to know what the minimum entries in the perl script would be to do a simple
>spidering.  Then after I'm comfortable that it works I could explore the
>rest of the customization options.

Good idea.  Back to the documentation:

perldoc spider.pl

NAME
       spider.pl - Example Perl program to spider web servers

SYNOPSIS
         swish.config:
           IndexDir ./spider.pl
           SwishProgParameters spider.config
           # other swish-e settings

         spider.config:
           @servers = (
               {
                   base_url    => 'http://myserver.com/',
                   email       => 'me@myself.com',
                   # other spider settings described below
               },
           );

         begin indexing:
           swish-e -S prog -c swish.config

But if that's too much, later in the docs it says:


       If all that sounds confusing, then you can run the spider
       with default settings.  In fact, you can run the spider
       without using swish just to make sure it works.  Just run

           ./spider.pl default http://someserver.com/sometestdoc.html

I will be the first to argue that the docs are not great, but they are
there, as well as the list archives which might show other examples.

I also totally agree that this stuff is not easy at the start.  But after a
while it will seem easy.

>What would also be great is an up to date list of what perl versions and
>modules are needed swish-e.  I guess with all the platforms supported that
>may be difficult but perhaps just the major platforms?

That's easy.  The current ones.  ;)

This is an old debate.

What can I say.  The spider is written in perl.  Most machine have perl,
but many machines have perl and modules that are years old.  System admins
should keep their machines up to date ;).  I let my DNS software get old
and my machine got hacked.  Have to say on top of these things....



My best advice for anyone developing web sites:  Install a current
distribution of Linux on your own machine and worry about the target host
later.



Bill Moseley
mailto:moseley@hank.org
Received on Tue Jan 29 15:29:51 2002