Skip to main content.
home | support | download

Back to List Archive

Re: MetaNamesRank working?

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Fri Jul 07 2006 - 15:24:53 GMT
Robert Stoeber scribbled on 7/7/06 9:03 AM:

> I tried to set this up in the config file like so:
> 
> MetaNamesRank 10 company
> MetaNamesRank  6 swishtitle
> 
> and the actual documents have headers like this:
> 
> <html>
> <head>
> <title>$headline</title>
> <meta name="company" content="$companyname">
> <meta name="pubdate" content="$pubdate">
> <meta name="cid" content="$companyid">
> </head>
> 
> So the question is, should this work?  Do I need higher numbers in the
> MetaNamesRank options?  Am I missing some other config option?
> 


The problem is that the HTML parser indexes words under one metaname 
only, so a search for 'companyname' only works if companyname is in the 
body of the doc. A search for 'companyname or company=companyname' will 
work more like you expect.

This is, IMHO, one of the shortcomings of the current HTML indexing 
scheme. It ought to be able to find 'companyname' no matter which 
metaname it is indexed under, then provide the more limiting feature of 
company=companyname.

Here's an example of what I'm talking about:

[karpet@cartermac:~/tmp/meta]$ cat c
MetaNamesRank 10 company
MetaNamesRank  6 swishtitle
IndexOnly .html
IgnoreTotalWordCountWhenRanking 0

[karpet@cartermac:~/tmp/meta]$ swish-e -w foo
# SWISH format: 2.5.4
# Search words: foo
# Removed stopwords:
# Number of hits: 2
# Search time: 0.006 seconds
# Run time: 0.030 seconds
1000 ./foo.html "my headline" 122
1000 ./bar.html "my headline" 122
.

[karpet@cartermac:~/tmp/meta]$ swish-e -w foo or company=foo
# SWISH format: 2.5.4
# Search words: foo or company=foo
# Removed stopwords:
# Number of hits: 2
# Search time: 0.004 seconds
# Run time: 0.030 seconds
1000 ./foo.html "my headline" 122
101 ./bar.html "my headline" 122
.

[karpet@cartermac:~/tmp/meta]$ cat foo.html
<html>
<head>
<title>my headline</title>
<meta name="company" content="foo co" />
</head>
<body>
  foo bar
</body>
</html>

[karpet@cartermac:~/tmp/meta]$ cat bar.html
<html>
<head>
<title>my headline</title>
<meta name="company" content="bar co" />
</head>
<body>
  foo bar
</body>
</html>



You would expect that a search for 'foo' would turn up foo.html with a 
much higher rank based on its meta tag content, but that only works if 
the query itself includes 'or companyname=foo'

So you can filter your queries before handing them to swish-e, so that 
they always include the 'or metaname=xxxx' stuff.

Or you might experiment with indexing HTML docs with the XML parser, as 
that has more options for dealing with metanames -- specifically you can 
use MetaNameAlias to get the expected behaviour:

[karpet@cartermac:~/tmp/meta]$ swish-e -w foo
# SWISH format: 2.5.4
# Search words: foo
# Removed stopwords:
# Number of hits: 2
# Search time: 0.005 seconds
# Run time: 0.029 seconds
1000 ./foo.html "foo.html" 107
633 ./bar.html "bar.html" 107
.

[karpet@cartermac:~/tmp/meta]$ cat c
MetaNames swishtitle
MetaNameAlias swishdefault html
MetaNameAlias swishtitle title
MetaNamesRank 10 company
MetaNamesRank  6 swishtitle
IndexOnly .html
DefaultContents XML*
IgnoreTotalWordCountWhenRanking 0

[karpet@cartermac:~/tmp/meta]$ cat bar.html
<html>
<head>
<title>my headline</title>
<company>bar co</company>
</head>
<body>
  foo bar
</body>
</html>

[karpet@cartermac:~/tmp/meta]$ cat foo.html
<html>
<head>
<title>my headline</title>
<company>foo co</company>
</head>
<body>
  foo bar
</body>
</html>



NOTE that you have to filter your HTML so that it uses "real" tags 
instead of the HTML <meta> tag. Other caveats include the real 
possibility that your HTML is not XHTML compliant and will cause lots of 
parsing errors from the XML parser. If you go that route, I suggest 
running your HTML through a multi-stage filter than includes htmltidy.


> Question 2) There are many company names that contain a hyphen, such as
> Alpha-Medical Systems, or A-Top Company.  If we include the hyphen in
> WordCharacters a search on "A-Top" is found correctly, but Alpha is not
> found as a separate word.  If we remove the hyphen from WordCharacters
> we can find Alpha, but not A-Top.
> 
> I've tried creating two indexes, one with a hyphen and one without, and
> then merging them.  But I got an error about different index types
> (can't remember the exact phrase).  Is there any other workaround that
> would treat hyphenated words as both a single word and individual words?

Make sure you don't have MinWordLimit set higher than 1. IIRC it 
defaults to 1.

here's an example. I would suggest NOT including - as a word character; 
that should allow you to match both Alpha-Medical and Alpha and Medical.

[karpet@cartermac:~/tmp/hyphen]$ cat t.html
<html>
foo-bar
f-bar
-bar
f-
</html>

[karpet@cartermac:~/tmp/hyphen]$ cat c
WordCharacters  0123456789abcdefghijklmnopqrstuvwxyz-
BeginCharacters 0123456789abcdefghijklmnopqrstuvwxyz-
EndCharacters   0123456789abcdefghijklmnopqrstuvwxyz-
[karpet@cartermac:~/tmp/hyphen]$ swish-e -c c -i t.html  -T indexed_words
Indexing Data Source: "File-System"
Indexing "t.html"
     Adding:[1:swishdefault(1)]   'foo-bar'   Pos:5  Stuct:0x9 ( BODY FILE )
     Adding:[1:swishdefault(1)]   'f-bar'   Pos:6  Stuct:0x9 ( BODY FILE )
     Adding:[1:swishdefault(1)]   '-bar'   Pos:7  Stuct:0x9 ( BODY FILE )
     Adding:[1:swishdefault(1)]   'f-'   Pos:8  Stuct:0x9 ( BODY FILE )
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 4 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
4 unique words indexed.
4 properties sorted.
1 file indexed.  37 total bytes.  4 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
[karpet@cartermac:~/tmp/hyphen]$ swish-e -w foo-bar
# SWISH format: 2.5.4
# Search words: foo-bar
# Removed stopwords:
# Number of hits: 1
# Search time: 0.005 seconds
# Run time: 0.034 seconds
1000 t.html "t.html" 37
.
[karpet@cartermac:~/tmp/hyphen]$ swish-e -w f-bar
# SWISH format: 2.5.4
# Search words: f-bar
# Removed stopwords:
# Number of hits: 1
# Search time: 0.009 seconds
# Run time: 0.034 seconds
1000 t.html "t.html" 37
.
[karpet@cartermac:~/tmp/hyphen]$ swish-e -w bar
# SWISH format: 2.5.4
# Search words: bar
# Removed stopwords:
err: no results
.
[karpet@cartermac:~/tmp/hyphen]$ swish-e -w foo
# SWISH format: 2.5.4
# Search words: foo
# Removed stopwords:
err: no results
.


and again with no -

[karpet@cartermac:~/tmp/hyphen]$ swish-e  -i t.html  -T indexed_words
Indexing Data Source: "File-System"
Indexing "t.html"
     Adding:[1:swishdefault(1)]   'foo'   Pos:5  Stuct:0x9 ( BODY FILE )
     Adding:[1:swishdefault(1)]   'bar'   Pos:6  Stuct:0x9 ( BODY FILE )
     Adding:[1:swishdefault(1)]   'f'   Pos:7  Stuct:0x9 ( BODY FILE )
     Adding:[1:swishdefault(1)]   'bar'   Pos:8  Stuct:0x9 ( BODY FILE )
     Adding:[1:swishdefault(1)]   'bar'   Pos:9  Stuct:0x9 ( BODY FILE )
     Adding:[1:swishdefault(1)]   'f'   Pos:10  Stuct:0x9 ( BODY FILE )
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 3 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
3 unique words indexed.
4 properties sorted.
1 file indexed.  37 total bytes.  6 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
[karpet@cartermac:~/tmp/hyphen]$ swish-e -w foo
# SWISH format: 2.5.4
# Search words: foo
# Removed stopwords:
# Number of hits: 1
# Search time: 0.004 seconds
# Run time: 0.032 seconds
1000 t.html "t.html" 37
.
[karpet@cartermac:~/tmp/hyphen]$ swish-e -w bar
# SWISH format: 2.5.4
# Search words: bar
# Removed stopwords:
# Number of hits: 1
# Search time: 0.004 seconds
# Run time: 0.042 seconds
1000 t.html "t.html" 37
.
[karpet@cartermac:~/tmp/hyphen]$ swish-e -w f-bar
# SWISH format: 2.5.4
# Search words: f-bar
# Removed stopwords:
# Number of hits: 1
# Search time: 0.004 seconds
# Run time: 0.032 seconds
1000 t.html "t.html" 37
.
[karpet@cartermac:~/tmp/hyphen]$ swish-e -w foo-bar
# SWISH format: 2.5.4
# Search words: foo-bar
# Removed stopwords:
# Number of hits: 1
# Search time: 0.004 seconds
# Run time: 0.032 seconds
1000 t.html "t.html" 37
.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Fri Jul 7 08:24:55 2006