Skip to main content.
home | support | download

Back to List Archive

[swish-e] indexing with DirTree.pl - help needed

From: mattack <paintitmatt(at)not-real.gmail.com>
Date: Fri Dec 14 2007 - 22:45:41 GMT
I'm trying to set up indexed searching of a file server running Debian
Etch. I'm using swish-e with DirTree.pl and accessing the index with
one of the example cgi's, swish.cgi.
I have the web interface working, but I'm having trouble with the indexing...
There are all kinds of files on the file server, .doc, .txt, .pdf,
.iso, .rtf, .xls, image files, .mp3, open office files,  and all kinds
of other stuff (sounds, movies, .iso files... you name it, it's
probably on this server). I made a test directory and copied some test
documents into it and started indexing (see command below).
Generally I'm happy. I can index the contents of lots of files,
including Word, Excel, and PDF files. It's pretty cool I think.

Here are my problems and questions.
*I'm really confused by the documentation. It assumes a lot of
knowledge that I don't have and seems scattered.
* swish-e indexes hidden directories and files even though I added the
example code to not index them into DirTree.pl. How can I stop this?
* I'd like the name of the file to show up in searches by "Title &
Body" in swish.cgi even if swish-e doesn't know how to filter the
contents, including text documents with no extension. This is not
happening. What can I do to make this happen?
* Is there a way to index OpenOffice.org files? both spreadsheet and
word processor? Can someone point me in a direction to look?

I'll probably think of more later...
Thanks, in advance.
-Matt


# uname -a
Linux neuliver 2.6.18-4-686 #1 SMP Mon Mar 26 17:17:36 UTC 2007 i686 GNU/Linux


# swish-e -V
SWISH-E 2.4.3


# less swish-e.conf
IndexName "the Liver"
IndexDescription "The is an index of files on the Liver."
IndexAdmin root
IndexFile /etc/swish-e/index.liver
IndexDir DirTree.pl
SwishProgParameters /var/local/testsearch #-no_skip
Metanames swishtitle swishdocpath
StoreDescription TXT* 10000
StoreDescription HTML* <body> 1000
IndexContents TXT* .txt .log .txt .rtf
# remove doc-root path so links will work on the results page
ReplaceRules remove /var/local/


User config section of DirTree.pl

#--------------- User Configuration Section ------------------------
# Regular expression that says these files are text
# even though SWISH::Filter thinks they might be binary

my @not_binary_extensions = qw/
    .pl
    .pm
    .c
    .conf
    rc
/;


# Subroutine to validate file names: return true if file is ok to process
# or false to skip the file.
# The first parameter passed in is the

sub check_path {
    my $path = shift;
    return if $path =~ /\.htaccess$/;  # don't index .htaccess files
    return 1;  # return true to process
}

sub check_dir {
    my $dir = shift;
    return ! m[^\.]; # don't process .directories
#    return 1;  # return true to process this directory
}

#-------------------- End User Config ------------------------------------


# swish-e -S prog -c /etc/swish-e/swish-e.conf
Indexing Data Source: "External-Program"
Indexing "DirTree.pl"
External Program found: /usr/lib/swish-e/DirTree.pl
Failed to set content type for document '/var/local/testsearch/nov1_agenda'
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5,242 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
5,242 unique words indexed.
5 properties sorted.
23 files indexed.  211,048 total bytes.  31,437 total words.
Elapsed time: 00:00:03 CPU time: 00:00:01
Indexing done!



-- 
I have no problem not listening to The Temptations.
                                                    -Mitch Hedberg
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Dec 14 17:45:44 2007