Skip to main content.
home | support | download

Back to List Archive

Re: SWISHE Perl module - index headers

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Nov 02 2001 - 17:26:26 GMT
At 08:25 AM 11/02/01 -0800, Alex Lyons wrote:
>> How does the perl module make it more portable?
>
>Probably not a big issue on most modern platforms, but it avoids having 
>to fork/exec the swish-e program: the Perl documentation describes how 
>to do this while avoiding a shell, but I haven't tried it on anything 
>other than Unix.

Yes, in my tests on Linux the difference in speed was not much.  I ran
apache benchmark that ran a swish query with mod_perl, and the difference
between using the perl module vs. using fork/exec was not really
noticeable.  Fork is very efficient now.

The big difference was that embedding swish into Apache make the apache
processes use more memory.  We need a swish-e server.


>> I do hope you validate the path.
>
>Hmm... What validation does SwishOpen do?  Surely it doesn't allow a 
>shell to see the index file name?  I had a simple -r check when I was 
>only allowing a single index but I took it out in preparation for 
>allowing multiple indexes like index=file1+file2 Perhaps I'll put it 
>back.

Oh I was just commenting about passing userdate unchecked in to your
program.  Just a general comment.  


>Some other comments:
>
>The TXT2 parser couldn't cope with empty files returned using the "prog" 
>method: my Perl spider returns empty files (actually Content-Length: 1 
>containing a single newline) if No-Contents: 1 is set. I had to revert 
>to TXT in this case.  The error showed up as a broken pipe, presumably 
>caused by swish-e aborting.

I just looked at something similar -- maybe it was with HTML2, but I'll
take a look again.


>It would be useful if the "prog" spider could tell swish-e what parser 
>(TXT,HTML,XML,TXT2, etc) to use for each file sent:

I like this question!  Document-Type:  I suppose it should be documented
someplace....

>all that IndexContents stuff in my conf file, and sometimes there is no 
>filename suffix anyway (eg: spider-generated directory indexes don't end 
>in ".html").  How about adding a "Swish-Parser:" header (or use the 
>standard MIME "Content-Type:" if you plan to eventually remove the 
>distinction between TXT and TXT2, etc, by moving completely to the 
>libxml2 parser)

I like Swish-Parser: since it's really explicit.  I'd like to add a
mime.types file to swish, and then assign parsers to content-types.  That
would be more inline with the rest of the world.


>With the introduction of the libxml2 parser and the resulting increase 
>in size of the executable and/or Perl DLL (.so) (I eventually did as 
>suggested and compiled swish-e twice, with and without libxml2, but what 
>a hassle!), I would suggest that the time has probably come to split 
>swish-e into an "indexer" and a smaller "searcher" that doesn't need all 
>the parsing stuff.  In fact, the indexer probably doesn't need the 
>built-in directory/web crawling facilities now that you have the "prog" 
>method and a range of Perl spiders that seem to do the job.

I totally agree.  I don't like the -S http part much and think it's a waste
of space.  A lot of code is shared so it would be a bit of work to figure
out what code was only needed for searching vs. indexing.

Splitting swish is more important if you are using the library where you
might have multiple processes all with the embedded swish code.  But I'd
think under most OS you would get copy-on-write help for the binary, so the
size is not a major issue.

I wonder if there's a way to build the perl module so that it just doesn't
load the libxml2 code.  Another option is how we build the libswish-e.a
library.  I wonder if we could just build two libraries so that when you
link the perl library you only link in the search code.

All this would be solved by a swish-e server, too, I suppose.


>Hope these comments help.

Yep.  It's good to know you are making use of the cool features of swish.
Yesterday I worked on a site that it partially static, partially (mostly)
generated from a MySQL database.  I was able use spider.pl to spider the
static site, and a 15 line program to index all the data in the database.
It's a tiny site with about 500 pages that indexes in about 4 seconds....

BTW -- I added an un-documented feature to the spider.pl program that when
spidering it will do a quick HEAD request for external links, so you can do
a poor-man's link validation of your site as you index, if you like.  Kind
of cuts down on indexing speed, though...




Bill Moseley
mailto:moseley@hank.org
Received on Fri Nov 2 17:26:52 2001