Skip to main content.
home | support | download

Back to List Archive

Re: Index specific list of files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jan 11 2002 - 05:09:14 GMT
At 08:56 PM 01/10/02 -0800, Frank Heasley wrote:
>>>* they are named, in sequence, 10000.htm to 19999.htm
>>>* I want to index an arbitrary subset of 1000 of those files
>>>
>>>A regex won't match.
>><<<<
>>Well, I wonder if you could /1[0-9][0-9][0-9][0-9]\.htm/
>
>you could, but that would match all of your files.

Sorry, I was doing too many things at once.  I thought you wanted to match
that range.


>>   my $num = $1 if /(\d+)\.html/;
>>   return unless $num && $num > 1235 && $num < 8712;
>
>Umm... I think you're trying to produce a list of files here, which is not 
>our problem... we already know what the list is - it's a pre-assigned set 
>of files.

Again, I was thinking of a range of numbers.

If you know a range of files, use IndexDir in the config.  I've done it
LOTS of times when I'm trying to narrow down a bug to a single source file.
 I have listed thousands of files, and then slowly cut them in 1/2 until I
found the problem.  

IF your list of files is generated, then the -S prog approach will be good
since you can put the code in to determine what files to index on-the-fly.

Sorry for not reading your question more carefully!

>I assume there's some mechanism using swishspider that retrieves and 
>indexxes web files one by one.  Would that, perhaps, be an approach?

Well, not swishspider.  swishspider only fetches documents, it's not really
a spider (unlike spider.pl used with -S prog) which is a real spider).

If you mean -S prog, sure.  That's exactly how it works.  You write a
program to fetch file files (records, URLs, whatever) and feed them to swish.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Jan 11 05:10:36 2002