On Fri, Dec 02, 2005 at 07:56:33AM -0800, David Larkin wrote:
> Then I read a paragraph which I simply don't understand.
>
> But, Swish-e will not use SWISH::Filter by default when using the
> file system method of indexing. To use SWISH::Filter when indexing
> by file system method (-S fs), you can use a FileFilter directive
> with the swish_filter.pl filter (which is just a program that uses
> SWISH::Filter) or use the -S prog method of indexing and use the
> DirTree.pl program for fetching documents.
I'll get to that in a second.
Think of swish in small functional units.
Swish basically parses html, xml, or text and creates an index.
How it gets documents varies and that's a separate feature.
Ok, so first there's the default -S fs -- that uses a built-in bit of
code to walk the file system and read files. That's really all it
knows how to do. But what do you do when you have non-text/xml/html
docs?
Then "FileFilter" was added as a way for *swish* to pass a document to
a program and read back the program's output. You needed to define a
filter for each type of file (based on file extension). That's not
great for a number of reasons (what does file extension have to do
with anything???) and you have to be specific about what programs
filter what.
[I'm leaving out the -S http method because it stinks]
Then swish added the -S prog method which allowed swish to read input
from STDIN if the input was formatted correctly (a header before each
document). That meant you could do something like:
some_program | swish-e -S prog -i stdin
All some_program has to do is output text, xml, or html, and a header
for each file saying the file name, file length, last modified, etc.
Now it would be nice to have a utility that can take any document and
look at it's content-type and then decided how to filter it into one
of the three formats that swish understands. That's what
SWISH::Filter does.
SWISH::Filter is passed a file (in memory or on disk) and it
determines the file's content type and then looks for a filter for
that file. It then returns the filtered file.
You can't really use it this way, but it's basically like:
fetch_files | SWISH::Filter | swish-e -S prog -i stdin
It's really done like this:
DirTree.pl <some params> | swish-e -S prog -i stdin
or
spider.pl <some params> | swish-e -S prog -i stdin
Look at DirTree.pl in your distribution and see how that works.
SWISH::Filter automatically loads filters that are installed.
SWISH::Filter also uses helper programs, for example it uses "catdoc"
to read MS Word docs.
So a user on a Debian-based system might do this:
spider.pl <some params> | swish-e -S prog -i stdin
and realize that MS Word docs are not being indexed. Then they would
do:
# apt-get install catdoc
and then they would magically get indexed because catdoc is now
available on the computer. It works because there's already a
SWISH::Filter::Pdf2HTML.pm module that knows how to use catdoc -- if
catdoc is installed.
Or, say someone wants to index OpenOffice docs and there isn't an
existing SWISH::Filter:: to do the work. So they create a
SWISH::Filter::OO2html.pm file (by copying an existing filter) and
then magically OO docs will be indexed with no changes to any configs.
The terminology is poor. SWISH::Filter is a module that loads
SWISH::Filters::* modules. A SWISH::Filters::* module may do all the
work of filtering, or it may use other modules or programs. Like
above, "catdoc" is used to read MS Word docs.
There's a wrapper program for SWISH::Filter called swish-filter-test:
$ swish-filter-test 050819-securing-mac-os-x-tiger.pdf
Document 050819-securing-mac-os-x-tiger.pdf was filtered.
Document: 050819-securing-mac-os-x-tiger.pdf (050819-securing-mac-os-x-tiger.pdf)
Content-Type: text/html
Parser type: HTML*
>Filter used: SWISH::Filters::Pdf2HTML=HASH(0x88b7994) ( application/pdf -> text/html )
Now, back to that paragraph.
The FileFilter allows swish to take a document it's processing and
pass it to an external program. So, there's the "swish_filter.pl"
program that allows you to use SWISH::Filter via a FileFilter
directive. I don't recommend using it that way, but it's possible.
> Can I have a single index for a directory with different filetypes ?
Sure.
> I guess I have to add lines like
>
> FileFilter .pdf pdftotext "'%p' -" IndexContents TXT* .pdf
>
> to the config file
You can still do that, but I think it's better to use SWISH::Filter.
> But then what args do I use with swish-e to create the index ?
>
> Is there an example, or tutorial anywhere ?
Are the docs that hard to follow?
There's this:
http://swish-e.org/docs/install.html#general_configuration_and_usage
which includes three steps to index a site.
Then this follows that:
http://swish-e.org/docs/install.html#spidering_and_searching_with_a_web_form_
which has all the steps for not only indexing but for creating a
search page. If you have catdoc and xpdf installed that will index
Word and PDF docs.
Right after that is:
http://swish-e.org/docs/install.html#indexing_other_types_of_documents_filtering
Which you read. It clearly states:
This has resulting in a bit of confusion.
which I wonder if that was on purpose.
Which, I think, explains what I said above.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Fri Dec 2 09:00:21 2005