Skip to main content.
home | support | download

Back to List Archive

Re: Problem with indexing xml with "prog" option

From: Bill Moseley <moseley(at)>
Date: Thu Dec 27 2001 - 14:25:02 GMT
At 03:37 AM 12/27/01 -0800, Cristiano Corsani wrote:
>I'm testing it with ms-access on w2000, but
>the final version will run on linux mysql.

Good, that will be helpful to see how it works on w2k.  There's some issue
with line endings in windows.  If you look at prog-bin/ you will
see that the script is in binmode  (binmode STDOUT).  Make sure your script
is in binmode as swish is expecting a single char \n line terminator.
Windows two char terminator will cause problems otherwise.

>Path-Name: ANA0003056.xml
>Content-Length: 128
>Last-Mtime: 41194
><bid>ANA0003056</bid><author></author><title>arazzi rubensiani e tessuti
>preziosi dei musei diocesani di ancona e osimo</title>
>Path-Name: ANA0002686.xml
>Content-Length: 111
>Last-Mtime: 41194
><bid>ANA0002686</bid><author>anselmi sergio</author><title>immagini delle
>marche negli archivi alinari</title>

You need a blank line after the headers.  Think of HTTP.  You also MUST
make sure that your content length is correct.

For example, in perl you might do this:

    binmode STDOUT;
    while ( my $rec = get_record() ) {

        my $length    = length $rec->content;
        my $mtime     = $rec->modified_timestamp;
        my $file_name = $rec->rec_id;

        print <<EOF;
    Content-Length: $length
    Last-Mtime: $mtime
    Path-Name: $file_name

        print $content;

Note that there's a blank line after Path-Name: header to separate the
headers from the content.  And also note that "print $content;" is not
within the "<<EOF" document.  The following won't work:

        print <<EOF;
    Content-Length: $length
    Last-Mtime: $mtime
    Path-Name: $file_name


That won't work because now there's an extra new line after $content.

Notes on your config:

>ParserWarnLevel 3
that's only for XML2 and HTML2

>EnableAltSearchSyntax yes
That's not implemented, although I'm not sure why.

One note about using XML compared to HTML.  In HTML the <title> is special.
 The <title> is indexed as normal body text, but the words are flagged as
title words, which rank much higher than normal body text.  So when
searching, hits on title words rank higher.  Also, in HTML, the <title> tag
is stored as the *property* "swishdefault".  

None of that happens in XML.  So, if you want your <title> words to rank
higher, then I'd recommend using HTML to format your data.  

If you are indexing html and xml at the same time and want all your titles
collected together you can use the tag <swishtitle> in your xml docs to
make the "title" in your xml get stored under the same property name as is
used to store the <title> when indexing HTML.  That way search results will
print the title regardless of xml or html.

[moving way off topic now...]

Speaking of <title>...the ranking is a bit broken:  Currently, if you have
"foo" in a HTML <title> tag, then every "foo" in the <body> is also flagged
as being a title word.  The effect is that a hit on "foo" will produce a
really high rank for that document, since all the "foo"s are flagged as
title words.

It's really noticeable when searching for a phrase:

> cat 1
foo bar

> cat 2
foo bar

You would expect doc "1" to rank higher here because "foo" is in the title.

> ./swish-e -w foo -H 0
1000 1 "foo" 28
187 2 "baz" 28

You would expect doc "1" to rank the same here since both have bar in their

> ./swish-e -w bar -H 0
1000 2 "baz" 28
1000 1 "foo" 28

But the phrase search demonstrates the problem:

> ./swish-e -w '"foo bar"' -H0
1000 1 "foo" 28
315 2 "baz" 28

Those should really rank the same.  The "foo" that was hit was really a
body only "foo", for both files.

The plan is to fix that, but that will increase the index size and memory
requirements since extra data must be stored for every word indexed.  Might
be a good candidate for a #define in config.h.

Oh, that's more than you were asking for.  Sorry!

Bill Moseley
Received on Thu Dec 27 14:25:13 2001