Skip to main content.
home | support | download

Back to List Archive

Re: Index specific list of files

From: Frank Heasley <DrHeasley(at)not-real.chemistry.com>
Date: Fri Jan 11 2002 - 04:56:42 GMT
See below for comments / responses

At 08:28 PM 1/10/02 -0800, you wrote:
>Hi Frank,
>
>I'm going to cc: the list, as this might come up again.
>
>At 07:53 PM 01/10/02 -0800, Frank Heasley wrote:
> >>>>
>>Hi Bill
>>
>>Thanks for the response.
>>
>>Perhaps this will  serve to illustrate:
>>
>>* I have a directory with 10,000 files.
><<<<
>On Windows?  Most OS will degrade in performance when you exceed some 
>number of files per directory.  It's a good idea to spread them around.
> >>>>

regardless - but let's say its Linux, just for the heck of it.


>>* they are named, in sequence, 10000.htm to 19999.htm
>>* I want to index an arbitrary subset of 1000 of those files
>>
>>A regex won't match.
><<<<
>Well, I wonder if you could /1[0-9][0-9][0-9][0-9]\.htm/

you could, but that would match all of your files.

remember, these files are NOT in sequence.  They are an arbitrary set.

eg:
12345.htm
43213.htm
93848.htm
73472.htm
87364.htm
etc. for 10,000 directory entries.

> >>>>
>>So, I would need to extend swish.conf by 1000 lines, right?  One line for 
>>each Filematch file?
><<<<
>That would be one way, too.  Or 100 x ten lines.
> >>>>
>>Would this cause any problems?
>>
>>Suppose I had 100,000  files and wanted to index 10,000 of them?
><<<<
>>What *I* would do is use -S prog.  I'd use the DirTree.pl program, and 
>>then do something like.
><<<<
>
>   my $num = $1 if /(\d+)\.html/;
>   return unless $num && $num > 1235 && $num < 8712;

Umm... I think you're trying to produce a list of files here, which is not 
our problem... we already know what the list is - it's a pre-assigned set 
of files.

How would you feed that list to Swish?


>>Also, this applies to v 2.2 only, not previous versions, right?
><<<<
>FileMatch?  Yes it was added in 2.1-dev-<something>
> >>>>

I assume there's some mechanism using swishspider that retrieves and 
indexxes web files one by one.  Would that, perhaps, be an approach?



>>Frank
>>
>>At 07:31 PM 1/10/02 -0800, you wrote:
>>>At 05:33 PM 01/10/02 -0800, Frank Heasley wrote:
>>> >Filerules allow you to specify files that do not get indexxed.
>>> >
>>> >However, is there any method to give swish a list of files such that only
>>> >those files DO get indexxed and all the rest of the files in the 
>>> directory,
>>> >which may be quite numerous and have very similar names,  are not?
>>>
>>>I asked that same question not too long ago.  Hence:
>>>
>>> 
>>><<http://swish-e.org/2.2/docs/SWISH-CONFIG.html#item_FileMatch>http://swish-e.org/2.2/docs/SWISH-CONFIG.html#item_FileMatch>http://swish-e.org/2.2/docs/SWISH-CONFIG.html#item_FileMatch 
>>>
>>>
>>>
>>>
>>>
>>>--
>>>Bill Moseley
>>><<mailto:moseley@hank.org>mailto:moseley@hank.org>mailto:moseley@hank.org
>>
><<<<
>
>
>
>--
>Bill Moseley
><mailto:moseley@hank.org>mailto:moseley@hank.org </blockquote></x-html>




*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Fri Jan 11 04:56:44 2002