Skip to main content.
home | support | download

Back to List Archive

Re: Unique Indexes

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri May 05 2006 - 18:19:55 GMT
sorry to delay.

Patrick O'Lone scribbled on 4/27/06 10:26 AM:
> I use the file system mode rather than spidering. The problem is with 
> multiple indexes. I index each day like:
> 
> 2006/04/01/articles/
> 2006/04/02/articles/
> .
> .
> etc.
> 
> For each day, there is a corresponding index, like:
> 
> 20060401.swish-e
> 20060402.swish-e
> 
> I then use a search against *.swish-e. Duplication occurs when an 
> article exists for more than one day - thus I use a Berkley DB file for 
> keeping track of checksums between days.

since you're dealing with multiple indexes, and each index has different 
URLs for identical content, there's really no way for swish-e to know 
that the files are identical in content. Swish-e relies on URL to 
differentiate.

You might try storing the checksum as a property in each file (using 
DirTree.pl or similar to prefilter the content). Then you could sort by 
rank, date and checksum, which should always give you the latest version 
of any identical articles first. Then you could use the API to filter 
out duplicate files based on checksum. Could get tricky when paging 
results though.

That's still extra overhead, but it would eliminate the dependency on 
the extra db.

An alternate idea would be to use the checksum as the URL (again with a 
filter before indexing) and then merge indexes periodically (like once a 
week or more), which would eliminate duplicates, since Swish-e will keep 
the newest version of each URL on a merge.

Maybe those two techniques together can give you some ideas.

As for a 2.4.4 release, that's really up to Bill as the project leader.

cheers,
pek


>> are the URLs you are passing to swish-e unique?
>>
>> Patrick O'Lone scribbled on 4/26/06 8:54 AM:
>>> Hello,
>>>
>>> I've been using swish-e for sometime now. I think it's a great 
>>> product, but I've had to use a special hack to avoid heavy 
>>> duplication issues within the index. I use MD5 checksums in an 
>>> external Berkley DB file for maintaining uniqueness within a 
>>> collection of documents - I was wondering if there is a better way. 
>>> Is it possible to have a unique key in a swish-e index file or would 
>>> that require the incremental mode feature? Also, will version 2.4.4 
>>> be coming out soon or is it on hold indefinitely? Thanks for any 
>>> feedback!
>>>
> 
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Fri May 5 11:20:01 2006