sorry to delay.
Patrick O'Lone scribbled on 4/27/06 10:26 AM:
> I use the file system mode rather than spidering. The problem is with
> multiple indexes. I index each day like:
> For each day, there is a corresponding index, like:
> I then use a search against *.swish-e. Duplication occurs when an
> article exists for more than one day - thus I use a Berkley DB file for
> keeping track of checksums between days.
since you're dealing with multiple indexes, and each index has different
URLs for identical content, there's really no way for swish-e to know
that the files are identical in content. Swish-e relies on URL to
You might try storing the checksum as a property in each file (using
DirTree.pl or similar to prefilter the content). Then you could sort by
rank, date and checksum, which should always give you the latest version
of any identical articles first. Then you could use the API to filter
out duplicate files based on checksum. Could get tricky when paging
That's still extra overhead, but it would eliminate the dependency on
the extra db.
An alternate idea would be to use the checksum as the URL (again with a
filter before indexing) and then merge indexes periodically (like once a
week or more), which would eliminate duplicates, since Swish-e will keep
the newest version of each URL on a merge.
Maybe those two techniques together can give you some ideas.
As for a 2.4.4 release, that's really up to Bill as the project leader.
>> are the URLs you are passing to swish-e unique?
>> Patrick O'Lone scribbled on 4/26/06 8:54 AM:
>>> I've been using swish-e for sometime now. I think it's a great
>>> product, but I've had to use a special hack to avoid heavy
>>> duplication issues within the index. I use MD5 checksums in an
>>> external Berkley DB file for maintaining uniqueness within a
>>> collection of documents - I was wondering if there is a better way.
>>> Is it possible to have a unique key in a swish-e index file or would
>>> that require the incremental mode feature? Also, will version 2.4.4
>>> be coming out soon or is it on hold indefinitely? Thanks for any
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Fri May 5 11:20:01 2006