Skip to main content.
home | support | download

Back to List Archive

Re: stemmer.c and swish-2.1.x

From: <jmruiz(at)not-real.boe.es>
Date: Tue Oct 31 2000 - 09:30:34 GMT
Hi Bill,

On 30 Oct 2000, at 12:05, Bill Moseley wrote:

> This is the kind of boring talk that results in a increase in the
> level of unsubscribe messages to Roy.  Sorry!
> 

Totally agree. Sorry.

> At 11:40 AM 10/30/00 -0800, jmruiz@boe.es wrote:
> >Well, it is a memory and portabilty issue. Why do you need to 
> >allocate MAXWORDLEN to stem a word? It is not neccesary.
> >If you avoid any reference to MAXWORDLEN, you will get a more
> >portable code that can be use outside swish-e.
> 
> I'll have to look at the code, I guess.
> 
> So in other words, you pass in a reference to the word, and a
> reference to its max length.  Then if Stem() needs more space than
> available it will malloc() a bigger chunk and return that (plus the
> new max length).  And I assume Stem() would also call free() on the
> original memory.
> 
> Am I getting it?
> 
Yes, you got it
> If I am getting it, then one thing I'd be concerned with if something
> in the calling code had a reference to the original word (it's
> address, of course) before Stem() reallocated it.
> 
> Which is one reason why stemmed = Stem( word ) might be better way to
> go. That is, allocate a temporary variable on the stack in Stem() to
> use to store the new stemmed word, and then malloc() a new variable
> right before returning and copy the temp variable to this new memory
> and return the pointer to the stemmed word.
> 
No problem, I can switch to this way:

stemmed = Stem(word,&length);

I do not explain myself clearly enough in my last post. Swish-e 
needs the length to be passed. Why?

Let us look an example:

#define MAXSTRLEN 2000   /* As it is in swish.h */

wordlen=MAXSTRLEN;
word=malloc(MAXSTRLEN+1);

strcpy(word,"hello");

Stem(&word,&wordlen);  /* or */

word=Stem(word,&wordlen);

As you see wordlen is the total buffer size (MAXSTRLEN), not the 
string length (5). 
Swish-e 1.3.x uses big buffers of MAXSTRLEN to allocate words and
Swish-e 2.x inherits part of this behaviour. So if your buffer is big
enough, like in the example, Stem do not need to reallocate space. 
So in most cases, word and stemmed word will point to the same 
memory area.
wordlen is then used by swish-e as the buffer length, not as the 
string length. This value is very useful to avoid buffer overruns and
to avoid many calls to malloc and free.
Of course, wordlen can also be strlen(word). this is the case where
the length of memory buffer area matches with the length of the 
string.

wordlen=strlen("hello");
word=emalloc(wordlen+1)

strcpy(word,"hello");

Stem(&word,&wordlen);  /* or */
word=Stem(word,&wordlen);

Thus, the returned value of wordlen is the size of the new buffer. If 
reallocated this value will be the string length of the stemmed word. 
If not, the original value remains unchanged.

If the buffer is reallocated, the old buffer is freed.

Well, I think that this approach can make a stemmer.c useful for
other programs. The code is more portable and reusable.

> Of course, as we discussed some time back, I tried this in my Stemmer
> perl module, but ended up with a memory leak as I'm not good enough at
> Perl's xs to know how to free memory......
> 

I can take a look at it.

> Or the other way is to simply force that a stemmed word is never
> longer than the original word.
> 

There are only three posibilities for increasing size:

static RuleList step1b1_rules[] =
           {
             {108,  "at",        "ate",   1,  2, -1,  NULL,},
             {109,  "bl",        "ble",   1,  2, -1,  NULL,},
             {110,  "iz",        "ize",   1,  2, -1,  NULL,},
..

But now, there is not any problem for increasing size. Well, At least 
I think so.
I have checked the new function with more than 8000 english words
extracted from the /usr/doc of linux and seems to work fine.

I will change:
Stem(&word,&wordlen);  

by

word=Stem(word,&wordlen);

This is a minor change. And as I have said, the returned code of 
Stem (TRUE or FALSE) is never used.

cu
Jose
Received on Tue Oct 31 09:36:33 2000