Re: Indexing cut off - more info

From: David VanHook <dvanhook(at)>
Date: Tue Apr 29 2003 - 14:38:11 GMT
Thanks very much for the quick reply.

The version we're running is SWISH-E 2.2rc1.

In looking at which files are getting indexed, and which ones aren't, it
appears that the titles for many documents are getting indexed, but not the

Which makes me wonder if the way I've got SwishCommand index and noindex set
up is causing the problem.  They're not balanced.  Here's an example

<TITLE>title here</TITLE>
<!-- SwishCommand noindex -->
	Junk HTML up here -- navbar, etc.
	Blah blah blah
<!-- SwishCommand index -->
	Document body goes in here
	Good document body
<!-- SwishCommand noindex -->
more junk code
more junk code
<!-- SwishCommand noindex -->

When I do a search for word X (on these bad indexes), it appears that word X
is only showing up when it appears in the title of the document.

I've had it set up this way from the very beginning, as I recall, but maybe
I'm remembering wrong.  Is it possible that SWISH is "remembering" the
unmatched NOINDEX command from previous documents and is getting confused


Dave V.

-----Original Message-----
[]On Behalf Of
Sent: Tuesday, April 29, 2003 10:03 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Indexing cut off - more info

On Tue, Apr 29, 2003 at 06:47:45AM -0700, David VanHook wrote:
> Here's a bit more information -- it appears that the logfiles for the
> indexings and the logfiles for the "bad" indexings are different in one
> respect.
> The number of files they index is the same: 21,000 files.  But on the bad
> ones, the indexer is finding 26041 unique words, and a total of 535,411
> total words.  On the good ones, the indexer is finding 108,563 unique
> and 5,971,632 total words.
> So it's seeing the files, but not indexing them completely.  I've looked
> the source code, and the SwishCommand noindex and SwishCommand index tags
> are in the proper spots.  And we've not made any edits to our stopwords
> since January.
> Any ideas which would cause the to look at the files but not
> them in this fashion?

Which version are you running?

Those are bid differences in word counts so you should be able to find a
single document to test with.  If not, there's probably a way to find the
bad files with -T and counting the number of words per file.

Then I'd just look at the output from and see what's missing.  If
nothing is missing then feed that output into swish and use -T indexed_words
and make sure it's all getting indexed.
