At 04:26 PM 04/22/02 -0700, Colin Kuskie wrote:
>So exactly at what point do the ReplaceRules take place? If they
>were implemented before swish-e invoked the swishspider, then the
>system should work as described.
It does work as described, but I'm just not sure it's clearly described.
Swish is basically:
input method (fs, http, prog) -> parser (xml, txt, html) -> indexer
And replacerules happens in the indexer, so it works for any input method.
The parser says, "Hey, index this data and, by the way, its file name is
this", and the indexer says, "ok, but first I'll run ReplaceRules to modify
the file name before storing it in the index."
The above isn't really accurate, as the HTML|XML|TXT parser gets the entire
file in a buffer, where HTML2|XML2|TXT2 gets passed a file handle. The
parsers send words to the indexer in chunks, not all at once.
It's up to the input method to pass to swish the files to index, and to
make sure duplicates are not indexed. With -S prog you can index every
document using the same file name, if you like, and swish won't care.
>I'll look at spider.pl, and I'll try to use Randal's pslinky program
>to do the downloading for me, just to kick it up another notch.
Or better, if you think it will make a difference spidering in parallel,
patch spider.pl. It's already designed with swish-e in mind.
Received on Mon Apr 22 23:51:51 2002