Skip to main content.
home | support | download

Back to List Archive

Re: attribute value attaching to wrords

From: <moseley(at)not-real.hank.org>
Date: Fri Nov 14 2003 - 06:18:54 GMT
On Thu, Nov 13, 2003 at 07:06:09PM -0800, Dave Moreau wrote:

> I think for XML attributes, it is more helpful to separate from surrounding 
> information regardless of spacing based on how XML attributes are normally 
> used.

Well, maybe someone else with XML experience can help out.  My laptop's 
battery is about out so I can't do much more searching tonight.  I can 
find discussions of the "ignorable whitespace" in xml, and swish-e does 
have code for dealing with that (it just adds a space into the input 
buffer), but I'm not finding what is the correct behavior when there is 
no whitespace between tags as in:

   <foo>fooword</foo><bar>barword</bar>

Unlike my example with bold (inline tags) I can't imagine in xml a 
situation where that should be "foowordbarword".  So here's a quick 
patch.

Index: parser.c
===================================================================
RCS file: /cvsroot/swishe/swish-e/src/parser.c,v
retrieving revision 1.48
diff -u -r1.48 parser.c
--- parser.c    4 Sep 2003 04:02:40 -0000       1.48
+++ parser.c    14 Nov 2003 05:52:17 -0000
@@ -993,8 +993,13 @@
         if ( sw->UndefinedMetaTags == UNDEF_META_ERROR )
                 progerr("Found meta name '%s' in file '%s', not listed as a MetaNames in config", tag, parse_data->fprop->real_path);
 
-        else if ( DEBUG_MASK & DEBUG_PARSED_TAGS )
-            debug_show_tag( tag, parse_data, 1, "(undefined meta name - no action)" );
+        else {
+            /* In general a single word doesn't span tags */
+            append_buffer( &parse_data->text_buffer, " ", 1 );
+
+            if ( DEBUG_MASK & DEBUG_PARSED_TAGS )
+                debug_show_tag( tag, parse_data, 1, "(undefined meta name - no action)" );
+        }
     }
             
 
@@ -1048,6 +1053,11 @@
     /* Don't allow matching across tag boundry */
     if (!is_html_tag && !isDontBumpMetaName(parse_data->sw->dontbumpendtagslist, tag))
         parse_data->word_pos++;
+
+    /* Tag normally separate words */
+    if (!is_html_tag)
+        append_buffer( &parse_data->text_buffer, " ", 1 );
+

I wonder if gcc will optimize those two checks on is_html_tag...


Still, there's another problem that this patch does not address: 

moseley@bumby:~/swish-e-2.4.0.patches/src$ cd

moseley@bumby:~$ cat 1.xml
<xml>
start<foo>word</foo><bar>another</bar>end
</xml>

moseley@bumby:~$ cat c
UndefinedMetaTags index
UndefinedXMLAttributes  index
DefaultContents xml2

moseley@bumby:~$ swish-e-2.4.0.patches/src/swish-e  -i 1.xml -v0 -T indexed_words
    Adding:[1:swishdefault(1)]   'start'   Pos:8  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'word'   Pos:9  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'another'   Pos:10  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'end'   Pos:11  Stuct:0x9 ( BODY FILE )

Notice those position numbers?  Since they are in sequence that means 
that pharse matching would "work".  That is you can search for the 
phrase:

   "start word another end"

Which is probably not what we want.

When swish-e ignores tags (i.e. they are not MetaNames) it's basically 
like they just don't exist in the text.  It might be better to bump that 
position number at the same time as adding a space in to separate the 
words at tags.

Oh, only 2%.  Time to halt.



-- 
Bill Moseley
moseley@hank.org
Received on Fri Nov 14 06:19:01 2003