Skip to main content.
home | support | download

Back to List Archive

Re: (Re)Definition of swishdefault

From: Guido Adam <guido.adam(at)not-real.gmx.de>
Date: Wed Aug 28 2002 - 16:07:50 GMT
Thanks for the greatly detailed answer :-)
Seems like my question(s) was/were not as clear as I thought...

I wrote a spider that does a kind of abstraction of plain text page, html 
pages etc. and produces the described xml data.

> >       MetaNames document url size type date crawldate keywords \
> >               description link title content

All these tags are metanames _and_ properties. They are used as fields. The 
idea is to be able to search for words in the document content as well as 
for documents of a certain size or contenttype. And all data are to be 
displayed if wanted.

Indexing the data is easy, but as I came to the search-interface, I found 
it annoying to use

> >       swish-e -f index_test -w "content=(harry AND potter)"

because <content> is where all the words are.
The searchsyntax and my search-script get simpler, if I can use

> >       swish-e -f index_test -w "harry AND potter"

for a simple search (less braces). Extended searches (like a field search) 
that use the other fields are expected to be more complicated.

As you wrote

>   MetaNameAlias swishdefault content

solves my problem.
I didn't understand the documentation in that point, it seems.
So the meaning of

         MetaNameAlias swishdefault content

is something like

         swishdefault = swishdefault + content

Greetings

Guido


At 28.08.2002 08:05 -0700, you wrote:
>Sorry, I'm rotten at giving short answers...
>
>At 05:47 AM 08/28/02 -0700, Guido Adam wrote:
> >All tags are defined as meta-tags in the swish.conf:
> >
> >       MetaNames document url size type date crawldate keywords \
> >               description link title content
> >
> >Problem:
> >If I search, I have to do something like
> >
> >       swish-e -f index_test -w "content=harry"
> >
> >I'd like to do
> >
> >       swish-e -f index_test -w "harry"
>
>Note that that is the same thing as
>
>        swish-e -f index_test -w swishdefault=harry
>
>That means your front-end code can be more generic.  Since *everything* is
>a metaname you can always specify a metaname and that will make it easier
>to program:
>
>         $swish_query = "$metaname=($query_words)";
>
> >Is it possible to "define" <content> as swishdefault, <title> as
> >swishtitle, <url> as swishdocpath and <description> as swishdescription? If
> >so, how to do that?
>
>I think you are mixing some concepts here.  Or at least you are asking two
>questions.
>
>Swish has properties and metanames.  Metanames are used for searching,
>where properties are used to store associated data with each file.  It's
>kind of backwards as properties are really metadata.
>
>So, you can alias the meta names while indexing:
>
>   http://swish-e.org/2.2/docs/SWISH-CONFIG.html#item_MetaNameAlias
>
>Remove "content" from MetaNames and instead add it as:
>
>   MetaNameAlias swishdefault content
>
>Then searching  ./swish -w foo will find "foo" even if it was in the tag
><content>.  Use the -T indexed_words option to index a single document and
>you can see how it works.
>
>Now, the other tags you list above sound more like properties.  So then you
>would use PropertyNameAlias instead.
>
>So I think those are your answers.
>
>If you are not *mixing* indexing of HTML and XML docs, then there's no need
>to map (alias) your tag names onto the default propertynames that swish
>uses.  Just use your names and use -x to get out the data you want.
>
>That's how swish works internally.  It just uses a default -x setting of:
>
>     "r %p \"%t\" %l"
>
>which is in long form:
>
>     -x '<swishrank> <swishdocpath> "<swishtitle>" <swishdocsize>\n'
>
>
>Now, the "title" is a special case, and I'm not really sure what you want
>to do.  I try to explain below.
>
> >The index contains xml data only.
>
>Just to be clear, HTML and XML parsing are basically the same.  There's
>three differences.
>
>1) HTML tags are not added when using "UndefinedMetaTags auto".
>"UndefinedMetaTags auto" might be useful when you are indexing XML and want
>every tag to be automatically created as a Metaname.  (My guess is this is
>not that useful of a configuration setting.)
>
>2) HTML tags set flags on the word indicating *where* in the HTML doc a
>word is found, such as in the <head>, <title>, <body>, <strong|b|em|i>,
><h*>.  These flags do two things.  First, they can be used with the -t
>switch to limit searches to words in those sections of a document (anyone
>use that feature?)  Second, the flags are used in ranking to rank some
>words higher than others, most commonly title words are ranked higher than
>body words.
>
>(BTW -- that flag is called the word's "structure")
>
>3) Text in HTML <title> tags are indexed as swishdefault, so you
>automatically search the title in addition to the body of the document.
>
>The MetaNameAlias thing happens after processing HTML tags.  So, although
>you can do:
>
>   MetaNamesAlias swishdefault title
>
>and get your <title> indexed as swishdefault, it will *not* have the flags
>to indicate that it is a title word and rank higher in search results.
>
>One plan is to be able to set a ranking bias by metaname so that you could
>say, rank words in <keywords>...</keywords> higher.  But that doesn't solve
>the problem of indexing <title> as swishdefault, and also making those
>words rank higher.  Plus, that won't work for aliases since alias mapping
>happens at indexing and rank calculation is done at search time.
>
>You can't make the parser assign the flags by simply indexing your .xml
>files as type HTML2 because the tag mapping doesn't happen in the parser --
>that is, the parser doesn't rename the tags before swish sees them (the
>mapping happens when swish lookups up the tags ID number).  You wouldn't
>want that because then you couldn't have separate alias mappings for
>metanames and property names.
>
>It might be possible to have a (yet another) config option that allows you
>to set the flags on tags.  Something like:
>
>   MetaNameAlias swishdefault title
>   StructureFlags title in_title in_head
>
>So that would emulate what happens when processing HTML.  Words in a
><title> tag get indexed as swishdefault metaname, plus those words are
>flagged as being title words (and in the <head> section, too).
>
>If I wanted that behavior today I'd use -S prog and write a perl program to
>parse my XML and output HTML.
>
>Ok, time for another cup of coffee.....
>
>
>--
>Bill Moseley
>mailto:moseley@hank.org
Received on Wed Aug 28 16:11:19 2002