On Thu, Nov 13, 2003 at 07:06:09PM -0800, Dave Moreau wrote:
> I think for XML attributes, it is more helpful to separate from surrounding
> information regardless of spacing based on how XML attributes are normally
> used.
Well, maybe someone else with XML experience can help out. My laptop's
battery is about out so I can't do much more searching tonight. I can
find discussions of the "ignorable whitespace" in xml, and swish-e does
have code for dealing with that (it just adds a space into the input
buffer), but I'm not finding what is the correct behavior when there is
no whitespace between tags as in:
<foo>fooword</foo><bar>barword</bar>
Unlike my example with bold (inline tags) I can't imagine in xml a
situation where that should be "foowordbarword". So here's a quick
patch.
Index: parser.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/parser.c,v
retrieving revision 1.48
diff -u -r1.48 parser.c
--- parser.c 4 Sep 2003 04:02:40 -0000 1.48
+++ parser.c 14 Nov 2003 05:52:17 -0000
@@ -993,8 +993,13 @@
if ( sw->UndefinedMetaTags == UNDEF_META_ERROR )
progerr("Found meta name '%s' in file '%s', not listed as a MetaNames in config", tag, parse_data->fprop->real_path);
- else if ( DEBUG_MASK & DEBUG_PARSED_TAGS )
- debug_show_tag( tag, parse_data, 1, "(undefined meta name - no action)" );
+ else {
+ /* In general a single word doesn't span tags */
+ append_buffer( &parse_data->text_buffer, " ", 1 );
+
+ if ( DEBUG_MASK & DEBUG_PARSED_TAGS )
+ debug_show_tag( tag, parse_data, 1, "(undefined meta name - no action)" );
+ }
}
@@ -1048,6 +1053,11 @@
/* Don't allow matching across tag boundry */
if (!is_html_tag && !isDontBumpMetaName(parse_data->sw->dontbumpendtagslist, tag))
parse_data->word_pos++;
+
+ /* Tag normally separate words */
+ if (!is_html_tag)
+ append_buffer( &parse_data->text_buffer, " ", 1 );
+
I wonder if gcc will optimize those two checks on is_html_tag...
Still, there's another problem that this patch does not address:
moseley@bumby:~/swish-e-2.4.0.patches/src$ cd
moseley@bumby:~$ cat 1.xml
<xml>
start<foo>word</foo><bar>another</bar>end
</xml>
moseley@bumby:~$ cat c
UndefinedMetaTags index
UndefinedXMLAttributes index
DefaultContents xml2
moseley@bumby:~$ swish-e-2.4.0.patches/src/swish-e -i 1.xml -v0 -T indexed_words
Adding:[1:swishdefault(1)] 'start' Pos:8 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'word' Pos:9 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'another' Pos:10 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'end' Pos:11 Stuct:0x9 ( BODY FILE )
Notice those position numbers? Since they are in sequence that means
that pharse matching would "work". That is you can search for the
phrase:
"start word another end"
Which is probably not what we want.
When swish-e ignores tags (i.e. they are not MetaNames) it's basically
like they just don't exist in the text. It might be better to bump that
position number at the same time as adding a space in to separate the
words at tags.
Oh, only 2%. Time to halt.
--
Bill Moseley
moseley@hank.org
Received on Fri Nov 14 06:19:01 2003