See below for comments / responses
At 08:28 PM 1/10/02 -0800, you wrote:
>Hi Frank,
>
>I'm going to cc: the list, as this might come up again.
>
>At 07:53 PM 01/10/02 -0800, Frank Heasley wrote:
> >>>>
>>Hi Bill
>>
>>Thanks for the response.
>>
>>Perhaps this will serve to illustrate:
>>
>>* I have a directory with 10,000 files.
><<<<
>On Windows? Most OS will degrade in performance when you exceed some
>number of files per directory. It's a good idea to spread them around.
> >>>>
regardless - but let's say its Linux, just for the heck of it.
>>* they are named, in sequence, 10000.htm to 19999.htm
>>* I want to index an arbitrary subset of 1000 of those files
>>
>>A regex won't match.
><<<<
>Well, I wonder if you could /1[0-9][0-9][0-9][0-9]\.htm/
you could, but that would match all of your files.
remember, these files are NOT in sequence. They are an arbitrary set.
eg:
12345.htm
43213.htm
93848.htm
73472.htm
87364.htm
etc. for 10,000 directory entries.
> >>>>
>>So, I would need to extend swish.conf by 1000 lines, right? One line for
>>each Filematch file?
><<<<
>That would be one way, too. Or 100 x ten lines.
> >>>>
>>Would this cause any problems?
>>
>>Suppose I had 100,000 files and wanted to index 10,000 of them?
><<<<
>>What *I* would do is use -S prog. I'd use the DirTree.pl program, and
>>then do something like.
><<<<
>
> my $num = $1 if /(\d+)\.html/;
> return unless $num && $num > 1235 && $num < 8712;
Umm... I think you're trying to produce a list of files here, which is not
our problem... we already know what the list is - it's a pre-assigned set
of files.
How would you feed that list to Swish?
>>Also, this applies to v 2.2 only, not previous versions, right?
><<<<
>FileMatch? Yes it was added in 2.1-dev-<something>
> >>>>
I assume there's some mechanism using swishspider that retrieves and
indexxes web files one by one. Would that, perhaps, be an approach?
>>Frank
>>
>>At 07:31 PM 1/10/02 -0800, you wrote:
>>>At 05:33 PM 01/10/02 -0800, Frank Heasley wrote:
>>> >Filerules allow you to specify files that do not get indexxed.
>>> >
>>> >However, is there any method to give swish a list of files such that only
>>> >those files DO get indexxed and all the rest of the files in the
>>> directory,
>>> >which may be quite numerous and have very similar names, are not?
>>>
>>>I asked that same question not too long ago. Hence:
>>>
>>>
>>><<http://swish-e.org/2.2/docs/SWISH-CONFIG.html#item_FileMatch>http://swish-e.org/2.2/docs/SWISH-CONFIG.html#item_FileMatch>http://swish-e.org/2.2/docs/SWISH-CONFIG.html#item_FileMatch
>>>
>>>
>>>
>>>
>>>
>>>--
>>>Bill Moseley
>>><<mailto:moseley@hank.org>mailto:moseley@hank.org>mailto:moseley@hank.org
>>
><<<<
>
>
>
>--
>Bill Moseley
><mailto:moseley@hank.org>mailto:moseley@hank.org </blockquote></x-html>
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Fri Jan 11 04:56:44 2002