Skip to main content.
home | support | download

Back to List Archive

Re: Incremental Mode

From: Patrick O'Lone <polone(at)not-real.townnews.com>
Date: Mon Feb 23 2004 - 14:52:58 GMT
Ironically, that's what I'm doing already. :-) I have an external
application ensuring that only unique documents are indexed by creating a
BerkleyDB on disk made up of MD5 checksums that point to the absolute path
of the document indexed. It works fairly well. Obviously, my only complaint
is that I have to operate in parallel on both the external database and the
SWISH-E database. However, this way solves a unique problem that occurs if
the database was stored internally in SWISH-E ... I can have unique
documents across multiple indexes.

If you have questions, comments, or suggestions about the aforementioned
message, you can respond by replying to this message or contacting us at
(309)-743-0800. Thank you.

Regards,

Patrick O'Lone
Software Project Manager
TownNews.com

(309)-743-0809
polone@townnews.com

> -----Original Message-----
> From: swish-e@sunsite.berkeley.edu 
> [mailto:swish-e@sunsite.berkeley.edu] On Behalf Of 
> redna@euskalerria.org
> Sent: Monday, February 23, 2004 2:33 AM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Incremental Mode
> 
> 
> >On Fri, Feb 20, 2004 at 07:01:06AM -0800, Ander wrote:
> >> I've been using incremental indexing mode, and I didn't had any 
> >> problem.
> It 
> >> works fine on my machines.
> >> 
> >> It bases on files modification date for local files, but 
> I'm not sure 
> >> it
> recognises
> >> modified "remote" files (web files).
> >
> >Not that's not exactly correct.
> >
> >-u only adds files to an existing index.  I suspect if you did:
> >
> >  swish-e -i foo.html     # creates new index
> >  swish-e -u -i foo.html  # add foo.html to existing index
> >
> >that foo.html would be in the index twice.
> 
> We could catch the content or calculate a MD5 checksum to 
> control which are the "modified" files, isn't it right?
> 
> We can create an BerkeleyDB hash to store the checksums of 
> dinamic web pages we want to index, and use the 
> filter_content feature of spider.pl to decide which are the 
> files to index.
> 
> What do you think?
> 
> >To use incremental indexing you have to build swish-e 
> differently (with 
> >--enable-incremental option).  This uses a different index format 
> >(Btree instead of a hash-based index) and is not compatible 
> with other 
> >(non
> >Btree) indexes.
> >
> >There's currently no way to update an index (i.e. say if an existing 
> >file is updated).  This type of incremental indexing might be useful 
> >for something like a mailing list where the index just gets added to 
> >(old messages don't changed).
> >
> >You can use other methods (like -D) to only pass to swish-e the new 
> >files to be added to an existing index.
> >
> >--
> >Bill Moseley
> >moseley@hank.org
> >
> >
> _________________________________________________________
> Txat euskalduna >>> http://www.euskalerria.org/solasgunea
> 
Received on Mon Feb 23 06:53:01 2004