On Thu, Oct 07, 2004 at 10:04:14PM -0700, Mark Greenaway wrote
Sorry for the delay. It got really dark on this side of the planet
for a few hours.
> Further to previous post
Thanks for the simple setup to test. Show the commands you are
running too, and their output -- so I can follow exactly what you are
doing.
I tried it two ways, and both work fine.
First without spidering:
moseley@laptop:~$ cat c
MetaNames outputs organisation strategy domain mission hqcountry countries web email
PropertyNames outputs organisation strategy domain mission hqcountry countries web email
SwishProgParameters nacl.pl
IndexDir spider.pl
moseley@laptop:~$ cat t.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<HTML>
<HEAD>
<meta name="organisation" content="Site4">
<meta name="strategy" content="research education">
<meta name="domain" content="government politics law">
<meta name="outputs" content="Papers Journals Newsletters Policy Research">
<meta name="countries" content="Australia">
<meta name="hqcountry" content="Australia">
<meta name="mission" content="To influence decision makers">
<meta name="web" content="http://www.site4.org.au">
<meta name="email" content="jim@site4.org.au">
<TITLE>Site4 - confusion reigns</TITLE>
</HEAD>
<BODY>
<H1>Site4 - NACL Matrix test site</H1>
<hr>
<a href="http://incres.anu.edu.au/nacl/index.html">link</a>
<hr>
</BODY>
</HTML>
moseley@laptop:~$ swish-e -c c -i t.html -v0 -T properties
swishdocpath: 6 ( 6) S: "t.html"
swishtitle: 7 ( 24) S: "Site4 - confusion reigns"
swishdocsize: 8 ( 4) N: "725"
swishlastmodified: 9 ( 4) D: "2004-10-08 05:33:21 PDT"
outputs:19 ( 43) S: "Papers Journals Newsletters Policy Research"
organisation:20 ( 5) S: "Site4"
strategy:21 ( 18) S: "research education"
domain:22 ( 23) S: "government politics law"
mission:23 ( 28) S: "To influence decision makers"
hqcountry:24 ( 9) S: "Australia"
countries:25 ( 9) S: "Australia"
web:26 ( 23) S: "http://www.site4.org.au"
email:27 ( 16) S: "jim@site4.org.au"
Ok, now spidering:
moseley@laptop:~$ swish-e -c c -S prog -v0 -T properties | grep swishtitle
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'nacl.pl'
Summary for: http://incres.anu.edu.au/nacl/matrixorgs.html
Connection: Close: 2 (0.2/sec)
Off-site links: 2 (0.2/sec)
Total Bytes: 1,734 (133.4/sec)
Total Docs: 3 (0.2/sec)
Unique URLs: 3 (0.2/sec)
swishtitle: 7 ( 33) S: "List of NACL Matrix Organisations"
swishtitle: 7 ( 27) S: "Site1 NACL Matrix test site"
swishtitle: 7 ( 24) S: "Site4 - confusion reigns"
Then to see what's in the index:
(swish prints things twice when dumping the index -- it's accessing
the properties using two different methods, IIRC):
moseley@laptop:~$ swish-e -T index_files | grep swishtitle
IgnoreTotalWordCountWhenRanking must be 0 to use IDF ranking
IgnoreTotalWordCountWhenRanking must be 0 to use IDF ranking
IgnoreTotalWordCountWhenRanking must be 0 to use IDF ranking
swishtitle: 7 ( 33) S: "List of NACL Matrix Organisations"
swishtitle: 7 ( 33) S: "List of NACL Matrix Organisations"
swishtitle: 7 ( 27) S: "Site1 NACL Matrix test site"
swishtitle: 7 ( 27) S: "Site1 NACL Matrix test site"
swishtitle: 7 ( 24) S: "Site4 - confusion reigns"
swishtitle: 7 ( 24) S: "Site4 - confusion reigns"
(Hey, Peter -- what's that IDF warning about?)
> Even this tiny example when run has no swishtitle
> If you remove one of the metatags from site4.html then swishtitle shows up
So, what am I doing differently?
Are you, by chnace forgetting to specify the index file when dumping
the index?
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Fri Oct 8 05:49:37 2004