Jerry Stratton wrote:
> Two things about IgnoreLimit:
> 1. It appears that if I set it too low, I get a core dump, "segmentation
> fault". It always occurs on the same file, so it is either very regular,
> or some sort of interaction between files and IgnoreLimit.
> This only (so far) has occured on creating large indexes: 4
> to 10 megabytes and appears to happen towards the last few files.
> If this is a known problem are the specifics of avoiding it known?
I haven't seen it, but you got me curious so I just had a look at the source
code to see if it was due to anything really obvious/easy to spot. (It
wasn't...). If you're able to get a core dump and then from that a traceback
showing which functions were active at the time, that could (if it's a
simple problem, at least) be sufficient to pin down the cause.
Identifying the common words and flagging them as stopwords is done after
scanning all the files (so it has the final word and file counts), so it's
not surprising it happens "near the end" of the run.
> 2. It may also be that I'm not understanding the significance of the two
> numbers in IgnoreLimit. It appears to me that they are two ways of
> representing the same number. Only the higher one takes effect. Is that
The comments preceding the removestops function in source file index.c say
/* Removes words that occur in over _plimit_ percent of the files and
** that occur in over _flimit_ files (marks them as stopwords, that is).
and it looks like that's pretty much what the code is doing. Actually, it
removes words occuring in *at least* the specified percentage and number of
files, rather than *over* those numbers. (>= rather than > in the tests)
The word does have to equal or exceed both numbers to be removed.
The two numbers are alternatives, but equivalent only if you know the total
number of files and adjust the count-based limit on that basis... I'd view
it as a way of saying "words occuring in more than x files are too common -
unless I'm indexing so many files that that's not really very many overall;
if those x files are at least y% of the files being indexed, then it really
*is* too many." Or "as a generalisation, anything occuring in more than x%
of the files must be too common to be useful - except that if x% is fewer
than y files, we might as well include them all."
A limit based on number of files is fine if you can predict reliably a
sensible number, otherwise the percentage is more useful; in either case the
other can be adjusted as a "safety-valve" to avoid eliminating too many
words in atypical case (if you're using the same config with different sets
of input files).
University of Cambridge WWW manager account (usually John Line)
Send general WWW-related enquiries to firstname.lastname@example.org
Received on Thu Feb 19 03:45:53 1998