Skip to main content.
home | support | download

Back to List Archive

Re: Ignore Question

From: Gentile, Jeff <GentileJ(at)>
Date: Mon Feb 24 2003 - 15:26:19 GMT
	Here is the -S prog code (only doing the technote dir at the moment) - notice that you had reccommended on friday I use a splice to skip my 8 line header... however I was getting erros with the "8" being beyond the array bounds with that mehtod, and rather then fix it I've gone back to my more primitive method. Also note that the "grep" in the filelist is to avoid the "." and "..", and to only look at files with a "txt" and "html" file extention. The thing that stinks here is I cant use the byte count of the file, since the header, although it will always be 8 lines, can vary as to it character size. Note I'm running on Linux, so binmode is already there. I suppose I could copy the contents into a new file, and index that, but that presents other problems, and obviously I'd like to avoid that. Thanks.


#!/usr/bin/perl -w

use strict;
use Cwd;
use File::stat;
use Fcntl ':flock';

my $cgidir = cwd();
my $rootdir = dirname($cgidir);
my $kbdir = "$rootdir/html/kb";
my $techdir = "$kbdir/tech";

sub tech_print; sub get_tech;


sub get_tech {
    opendir DH, $techdir || die "Open $techdir for read failed! $!";
    my @filelist = grep !/^\.\.?$/ && (/\.html?$/ || /\.txt$/), readdir DH;
    closedir (DH) || die "Close $techdir failed! $!";
    foreach (@filelist) {
        $_ = "$techdir/$_";

sub tech_print {

    my $filename = shift;
    my $doctype;
    my $docsize = '';
    if ($filename =~ /\.html?$/) {
        $doctype = 'HTML*';
    } else {
        $doctype = 'TXT*';
    open (FH, "<$filename") || die "Cannot open $filename for read $!\n";
    flock (FH, LOCK_SH) || die "Cannot get shared lock for $filename $!\n";
    my $mtime = stat($filename)->mtime;
    my @line=<FH>;
    close (FH) || die "Cannot close $filename $!\n";
    for (1..8) {
        shift @line;
    foreach (@line) {
        $docsize .= $_;

    #my $docsize = join "\n", splice( @line, 8 );
    my $size = length $docsize;

print <<EOF;
Path-Name: $filename
Content-Length: $size
Last-Mtime: $mtime
Document-Type: $doctype

    print @line;

-----Original Message-----
From: Bill Moseley []
Sent: Monday, February 24, 2003 09:45
To: Gentile, Jeff
Cc: Multiple recipients of list
Subject: RE: [SWISH-E] Re: Ignore Question

On Mon, 24 Feb 2003, Gentile, Jeff wrote:

> Thanks... I figured out what's going on, I was indexing my script... LOL! as
> I hadn't added the -S prog switch... thinking that I could have one config
> file index both a file system and a prog.... 

I've wanted for a long time to change -i and IndexDir to accept some fake
URLs as in:


Then they could all be in a single config file.

You can somewhat use as a replacement for -S fs, so you can
have a single -S prog that reads the file system, runs the spider, and
indexes a database all in one program.  (I index one site that's part
static pages which I index with and part database which I index
with the MySQL script.) 

> Is "-T indexed_words" an undocumented feature? It's great!

No, it's documented on the SWISH-RUN man page, but it only says it exists
and not what all the options are (-T help shows what's available).  It's
really there for helping with development and expected as part of the
normal user interface, although I use it to extract out the words from the
index for use in a spell checker.

> However, now my content-lengths are off, I think do to some of the odd characters
> in the tech notes... there isn't some undocumented addition of a "EOF" char string
> type feature, is there?

No there isn't.  Might be a good addition.  

Perl's "length" counts characters, where swish-e is expecting a byte
count.  I have not looked what might happen if the string in Perl contains
multi-byte chars.  Perhaps that's what is happening in your case.  It's
more common to simply count the wrong number of bytes or use binmode so
lengths are counted incorrectly.

Can you post your -S prog code?
Received on Mon Feb 24 15:28:13 2003