Skip to main content.
home | support | download

Back to List Archive

Re: ignoring words inside form elements

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Apr 12 2001 - 01:00:12 GMT
At 05:13 PM 04/11/01 -0700, Myke Komarnitsky wrote:
>I got a quick response from that last one, thank you!  Another query:  we 
>use a pulldown for navigation, so the html looks like:
>
><SELECT NAME="routes" onChange="switchRoute(this.form)">
><OPTION VALUE="">Choose a Route</OPTION>
><OPTION  VALUE="crimp_master">Crimp Master - *</OPTION>
><OPTION  VALUE="dyno_one">Dyno One - *</OPTION>
></SELECT>
>
>Unfortunately, queries that match the pulldowns (say "Crimp Master") pull 
>up every page with that in the nav tool.... is there a way to ignore text 
>inside a form element such as this?

It depends.

There's a new directive in the development version called IgnoreMetaTags,
but currently it's only available when indexing XML files.  Maybe that
could be extended to HTML, but then there's more complicated parsing issues.

You could index your files as XML to make use of this feature, depending on
what you need (e.g. context searching).  But, alas, it seems like tag
attributes in the option tag trick the current code.

IndexContents XML .html
IgnoreMetaTags select

But, currently that only works if the tag was <SELECT> not 
<SELECT NAME="routes".....  I think that's a bug.

Of course there is a way to do this with the new "prog" document source
feature ;)

You can use Perl's HTML::Parser (or maybe HTML::TreeBuilder) and trim out
the <OPTION> tags and content.  You would have full control over what gets
indexed by that method.  You could output plain text if that's all you need
indexed (and you will be able to tell us what is faster: HTML::Parser or
Swish's internal parser), or you could output XML if you want to use
metatags, or HTML if you want metatags and/or context searching and ranking
by context (<em> rank higher than plain text).

Maybe someone else has a better suggestion.





Bill Moseley
mailto:moseley@hank.org
Received on Thu Apr 12 01:03:46 2001