Skip to main content.
home | support | download

Back to List Archive

Re: Question: indexing of html pages containing

From: Bill Moseley <moseley(at)>
Date: Fri Jun 14 2002 - 03:02:24 GMT
At 03:35 PM 06/13/02 -0700, Bruce Rodney wrote:
>The HTML files contain header info plus tables of floating point data, e.g.
>a coordinate such as 1532727.45. The tables contain 100's of rows of this
>rather tedious data which I wish to EXCLUDE from the index. My initial
>approach was to use BeginCharacters to only index words starting with [a-z].
>Problem is: there may be valid keywords in the header info above the table
>with are in fact integers, e.g. a unique object identifier such as
>102320000. So I really want to let integers be indexed, but not floats.

Well, not for the version you are running.  Not really even for
2.1-dev-soon-to-be-released-when-we-get-time version.

If it's a unique identifier they why do you need to index it?  Use a dbm
table to build that index.

In 2.1-dev there's the IgnoreNumberChars config, but that won't work
because integers would match if you set it to "0123456789.", but that also
means there is code in swish to do almost what you want.  You could just
add a bit of code to check if there's a "." and if not allow indexing.

Look in 2.1-dev's src/index.c for:

     /* Weed out Numbers - or anything that's all the listed chars */

You would have to include the "." in WordCharacters, too, which you
probably don't want to do (although you could use IgnoreLastChar to deal
with periods at the end of sentences).

I don't think IgnoreNumberChars is really that useful.  It would not be
much work to add a regular expression check in that code.  That would
probably be more useful.

So the short answer is that there no easy way to index integers but not

Are all the floats within a given tag?  If that's the case then you can
pre-parse the docs.  With 2.1-dev swish-e can run an external program that
fetches your documents.  That program could remove the the floats before
passing the document off to swish for indexing.  It's a lot easier than it
probably sounds. ...

Bill Moseley
Received on Fri Jun 14 03:05:57 2002