Skip to main content.
home | support | download

Back to List Archive

Re: strange indexing order in swish-e 2.2 rc1

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Sep 02 2002 - 22:31:02 GMT
At 03:07 PM 09/02/02 -0700, Trond Nilsen wrote:

>I've been playing about with Swish-E 2.2 rc1 a bit for the last day or two, 
>mostly using the HTTP method.

Thanks for testing!

BTW: -S prog with spider.pl will give better performance and more features.

>A simple case demonstrates the strangeness
>
>I'm indexing a site with three html files
>
>foo.html, containing links to bar.html and bas.html
>bar.html, containing links to foo.html and bas.html
>bas.html, containing links to foo.html and bar.html
>
>Pointing at foo.html for a start..
>
>1  D:\temp\swish22test>d:\progra~1\swish-e22\swish-e -c test.cfg -S http

Try adding -T properties and see that all three files are indexed.

You are assuming that the order of output is the order of processing.  I
think what's happening is swish-e's telling you "Already Indexed" when it
really means "I've already seen this URL and have either indexed it or have
queued it up to index so I don't need to add it to my URL's-to-index queue
again."


>6  retrieving http://localhost/foo.html (0)...
>7  Returned 0
>8   - Using DEFAULT (HTML) parser -  (4 words)
there's "foo.html" being indexed

>9  retrieving http://localhost/bar.html (1)...
>10 Returned 0
>11  - Using DEFAULT (HTML) parser -  (4 words)
there's "bar.html" being indexed


>12 Skipping http://localhost/foo.html:  Already indexed.
>13 Skipping http://localhost/bas.html:  Already indexed.

It's saying bas.html "Aready indexed" because it knows about "bas.html"
from fetching the first document "foo.html", so the link to "bas.html" is
skipped when read from "bar.html".  (That make any sense?)


>14 retrieving http://localhost/bas.html (1)...
>15 Returned 0
>16  - Using DEFAULT (HTML) parser -  (4 words)
and finally there's "bas.html" being indexed.



-- 
Bill Moseley
mailto:moseley@hank.org
Received on Mon Sep 2 22:34:30 2002