I've been playing with swish-e 2.4.5 for a couple of days (for
indexing an intranet with around 8000 pages and PDFs) and have been
modifying the spider scripts a bit. Not all mods are suitable for the
rest of the world, but some are and I wonder whether it would be
appreciated if I clean them up for general use.
1. Spider traps
I encountered a couple of cgi/php scripts that generated nearly
infinite numbers of unique URIs. I first tried filtering the URLs with
regexps, but I added a feature that URIs with more than 2
(user-definable) CGI parameters are counted and after a certain
user-definable number of similar URLs, the spider stops fetching them.
is indexed, but counted as
and after 10 (user-definable) times the spider stops following links.
2. Bad LaTeX-generated PDF. Some LaTeX installations generate PDFs
with a nonstandard font encoding, which are transformed by
pdftotext into loads of garbage. I try to catch them with a rather
ad-hoc regexp which seems to work - not really distribution-quality
3. One of our intranet servers delivers everything, including PDFs, as
content-type text/something. I'm filtering that as well. Also
questionable code for general use.
4. If I enable a metagroup 'all' in swish.cgi in order to search for
keywords that are either in the title/body or in the URL, it doesn't
work as expected. The reason is that a query "a b" is expanded to
to something like
swishdefault=(a b) OR swishtitle=(a b) OR swishdocpath=(a b)
but it won't find anything. I replaced it by
swishdefault=(a OR b) OR swishtitle=(a OR b) OR swishdocpath=(a OR b)
but the ranking algorithm doesn't seem to give a bonus to documents
that contain both a and b somewhere. To really fix this, the indexer
should be made able to create a metaname database column for words
that are in any of swishdefault and swishdocpath. However, I couldn't
find any suitable configuration options and I'm not sure I'm willing
to invest the time to figure out how to modify the source code myself.
5. I added the minus sign "-" as an alias for the NOT operator in CGI
queries, so that people used to Google don't have to remember a
Users mailing list
Received on Wed Aug 8 10:12:09 2007