At 12:29 AM 1/26/2002 -0800, Cristiano Corsani wrote:
>my problem: I have a db that I index with swish-e using prog
>option. How can I manage these situations?
>1) A record has ben added;
>2) A record has been modified;
>3) A record has been deleted.
Swish is really good for small to medium size collections of data that
don't change very fast. What that means is probably up to your needs.
It's number 2 (modified) that's really the hard part.
Incremental indexing is discussed a little in the FAQ. And incremental
indexing is very high on the "to do" list for swish, so there may be hope.
Here's what I have done in the past. It's a lot like making incremental
backups: You create one main index. Then when any record is changed you
set a flag someplace (which could even be just that a file's date has
Then use cron to run a job every few minutes to look for that flag that
something was modified or created. If the flag is found you run indexing
on only the files/records that have a modified date later that the date the
full index was created.
When searching you specify both indexes at the same time. But before
displaying search results you have to check each result against the source.
Probably the simplest thing is to check the last modified date returned
from swish and see if it matches the date on the real resource, and throw
the result away if it doesn't.
This system is a lot easier if you are indexing directly from a database,
since when indexing you can write the time you made the full index, and
then when making incremental updates you just ask the database for all
records with a date later than the date set when making the full index.
For files, there's a couple of way. One is to use the -N switch and swish
will only index files newer than some other file. But I've also used a
system where new or modified files write a symlink to a parallel directory
structure, and use that to create the incremental index. The only
advantage there is that you don't have to stat() the entire live directory
tree to find only two or three new files, and when full indexing runs it
only needs to delete the symlinks.
There's race conditions in there, especially when indexing the file system,
so you have to think carefully about setting the date flags. As I said,
it's easier if the data is stored in a database.
As for when to do full indexing: I've had the incremental indexing measure
the time required to index and when it got over some limit I set a flag
which a nightly cron job uses to decide if a full index should be built.
A good example of how this might work is indexing a mail archive. Once a
week you index the entire archive. But during the week when a new message
comes in you index just the new messages since the last full index. Easy
since there's typically no deletions or changes, just additions.
>I'm not sure I have understand correctly: may I have to reindex
>all? But SWISH-E use a "Last modified" field. Does it means that
>it is possible to reindex a portion of the whole db?
No, the last modified field is just another property that can be displayed.
There's nothing special or magical about that property. The current index
design does not allow for changes to an existing index file.
A lot of times people want incremental indexing and decide swish is not
usable because it doesn't currently have a way to do incremental indexing.
But unless your data is changing every few minutes or less, or you have a
huge amount of data, swish is so fast at indexing that sometimes it's just
as easy to reindex the entire data set.
Here's indexing 50,000 very small documents with -S prog and a perl program
to generate the data:
50009 unique words indexed.
4 properties sorted.
50000 files indexed. 7227788 total bytes. 800000 total words.
Elapsed time: 00:01:20 CPU time: 00:01:19
So, it's really an issue of how often your source data is changing, and how
long the index can be out of sync with the source data, and how large is
the data set (that is, how long does full indexing take).
BTW -- don't forget to search the swish-e discussion list archive. This
topic has come up before.
Hope this helps,
Received on Sat Jan 26 14:30:46 2002