I've been playing about with Swish-E 2.2 rc1 a bit for the last day or two,
mostly using the HTTP method.
For part of this time, I was convinced that it was arbitrarily labelling files
as 'already indexed' during spidering.
A simple case demonstrates the strangeness
I'm indexing a site with three html files
foo.html, containing links to bar.html and bas.html
bar.html, containing links to foo.html and bas.html
bas.html, containing links to foo.html and bar.html
Pointing at foo.html for a start..
1 D:\temp\swish22test>d:\progra~1\swish-e22\swish-e -c test.cfg -S http
3 Indexing Data Source: "HTTP-Crawler"
4 Indexing "http://localhost/foo.html"
5 Returned 0
6 retrieving http://localhost/foo.html (0)...
7 Returned 0
8 - Using DEFAULT (HTML) parser - (4 words)
9 retrieving http://localhost/bar.html (1)...
10 Returned 0
11 - Using DEFAULT (HTML) parser - (4 words)
12 Skipping http://localhost/foo.html: Already indexed.
13 Skipping http://localhost/bas.html: Already indexed.
14 retrieving http://localhost/bas.html (1)...
15 Returned 0
16 - Using DEFAULT (HTML) parser - (4 words)
17 Skipping http://localhost/foo.html: Already indexed.
18 Skipping http://localhost/bar.html: Already indexed.
<snip rest of output>
Note that by the time swish has reached line 13, it has indexed 'foo.html',
and 'bar.html', but not 'bas.html'. However, it proceeds to incorrectly say
that it is 'Already indexed'.
I presume that what's happening is that swish is building up a list of pages
to index, and delivering the 'Already indexed' message based on whether a page
exists within that list, when it should really be just ignoring duplicates.
Already indexed should be reserved for files that have indeed been indexed
It took me a while to isolate this - indexing a real site, I got large
quantities of 'Already indexed' messages which hid what was going on, and
confused the hell out of me.
Is this a known problem? Or some sort of idiosyncracy I've managed to
introduce in my config. I've got the test site and my config file here if
anyone wants to take a look. Also, I'm using the Swish-E 2.2 rc1 windows
If I have time this afternoon, and I've not heard anything else, I might go
bug hunting myself..
Trond Nilsen Alchemy Group
Software Engineer http://www.alchemy.co.nz
Received on Mon Sep 2 22:11:42 2002