At 03:07 PM 09/02/02 -0700, Trond Nilsen wrote:
>I've been playing about with Swish-E 2.2 rc1 a bit for the last day or two,
>mostly using the HTTP method.
Thanks for testing!
BTW: -S prog with spider.pl will give better performance and more features.
>A simple case demonstrates the strangeness
>I'm indexing a site with three html files
>foo.html, containing links to bar.html and bas.html
>bar.html, containing links to foo.html and bas.html
>bas.html, containing links to foo.html and bar.html
>Pointing at foo.html for a start..
>1 D:\temp\swish22test>d:\progra~1\swish-e22\swish-e -c test.cfg -S http
Try adding -T properties and see that all three files are indexed.
You are assuming that the order of output is the order of processing. I
think what's happening is swish-e's telling you "Already Indexed" when it
really means "I've already seen this URL and have either indexed it or have
queued it up to index so I don't need to add it to my URL's-to-index queue
>6 retrieving http://localhost/foo.html (0)...
>7 Returned 0
>8 - Using DEFAULT (HTML) parser - (4 words)
there's "foo.html" being indexed
>9 retrieving http://localhost/bar.html (1)...
>10 Returned 0
>11 - Using DEFAULT (HTML) parser - (4 words)
there's "bar.html" being indexed
>12 Skipping http://localhost/foo.html: Already indexed.
>13 Skipping http://localhost/bas.html: Already indexed.
It's saying bas.html "Aready indexed" because it knows about "bas.html"
from fetching the first document "foo.html", so the link to "bas.html" is
skipped when read from "bar.html". (That make any sense?)
>14 retrieving http://localhost/bas.html (1)...
>15 Returned 0
>16 - Using DEFAULT (HTML) parser - (4 words)
and finally there's "bas.html" being indexed.
Received on Mon Sep 2 22:34:30 2002