Skip to main content.
home | support | download

Back to List Archive

Re: Several aspects concerning configuration & search behaviour

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Feb 17 2005 - 00:07:39 GMT
On Wed, Feb 16, 2005 at 02:14:57PM -0800, Uwe Dierolf wrote:
> We have metaname that can contain all strings even stopwords.
> At the moment stopwords are used for all XML elements
> (== metanames).
> Would it be much work to exclude some specific metanames
> from stopword handling?

Not real easy.

<out>
    <in>
        foo bar baz
    </in>
<out>

So what you want to do is say:

StopWordsMeta in foo

Note that those words are inside both tags.

The problem is when "foo" is parsed, it's passed to the code that
looks for WordCharacters, stopwords and related options.  After that,
it's passed to the indexing code, which then adds the word for each
meta:

static void addword( char *word, SWISH * sw, int filenum, int structure, int numMetaNames, int *metaID, int *word_position)
{
    int     i;

    /* Add the word for each nested metaname. */
    for (i = 0; i < numMetaNames; i++)
        (void) addentry(sw, getentry(sw,word), filenum, structure, metaID[i], *word_position);

    (*word_position)++;
}


> Another practical aspect concerns TranslateCharacters.
> For german umlauts it would be very good if we could
> map one character to two.
> How much work would it be to integrate a config directive like:
> TranslateCharacters  ae  oe  ue  ss
> TranslateCharacters xyz abc
> It's TranslateCharacters with two meanings
>  - tupel notation
>  - strings with equal length notation

Would need to code the config parser and a way to store and retrieve
it from the index header, and then a way to map the chars.  The
current system is just a 256 element array map.  Might need to do a
2-pass look to determine how much memory to reallocate for the
resulting string.


> Search aspects:
> ---------------
> The next point concerns IgnoreLastChar.
> We are excluding for example "-".
> But if we are searching for "foo-*" we want to find
> words containing "-". So it's wrong if swish-e handles
> IgnoreChars in cases of truncated searches.

Well, I suppose that if there's a wildcard then there's no reason to
remove IgnoreChars.

I argued long ago that you should be able to copy some text from the
original document (punctuation and all) and search for it and have
swish find that text.  So that means that the search parser must
process the input text exactly like the indexing code does.  And that
means splitting text on wordcharacters, removing IgnoreFirst/Last
chars, stopwords and so on.

Clearly, if swish takes "foo-bar" and indexes it as two words "foo bar" then
searching for "foo-bar" should really search for the two words.

IIRC, the "parser" tokenizes "foo*" into "foo *" and then after
processing "foo" into a "swish word" then adds the wild card back on.
So, the parser would need to see that it was working with a wild card
and skip the ignore characters.

So, you could likely modify swish_words:next_swish_word() to not
call stripIgnoreLastChars() if the next token is "*".  That might not
be too hard of a patch.

Do you know (or know anyone that knows - or anyone that could be tricked
into knowing) bison/yacc?  Talk a little time in swish_words.c and you
will see what I mean.


> And last we found a little strange behaviour.
> If we have two index files (one large index and one
> for daily changes) and the small index which is
> searched after he large index does not contain all
> of the metanames of the large index swish-e does not
> return "0 hits" if there are no records found at all
> but instead returns "unknown metaname".
> We believe that this behaviour is not correct because 
> the missing metaname(s) in the small index are metanames 
> of the first index.

It's considered an error.  You are asking to search an index with a
given metaname that doesn't exist in that index.  As far as swish-e
knows you mistyped the metaname.

Just add the missing metanames to the index

It's open source, so anything is possible.  So you could catch that
error, look up at the parent search object and check all the
associated open indexes for that metaname and ignore the error if you
like.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Feb 16 16:07:40 2005