Skip to main content.
home | support | download

Back to List Archive

Re: Unexpected index file size reduction

From: Lauren <lauren(at)not-real.landsburg.net>
Date: Fri Sep 27 2002 - 12:34:33 GMT
Bill> Which index file?  index.swish-e or index.swish-e.prop?  If the .prop file
> is the one that is smaller that may be due to zlib compression.


They both suddenly got smaller, but the one I reported as dropping from 
35.9 to 25.5 meg was index.swish-e.

Bill> Jose is constantly working to compress the index file, so although I can't
> remember a specific change, it's possible you are seeing the results of his
> efforts.
> 

Jose> I have
> spent a lot of effort on this issue in the very last versions.
> How small it should be relays on the type of docs. Also, Bill
> added zlib support for properties.


I could use some simplifying translation here.  What is puzzling me is 
not that the size dropped _when_ we moved to swish-e 2.2, but that it 
dropped at a subsequent time, when we hadn't made any updates to our 
version.  We put 2.2 in place and used it for several months.  The index 
file grew gradually to 35.9 meg; then the next week it was much smaller 
with no evident change on _our_ end. I am convinced that no .html files 
are being skipped.

So my question is this: Jose: Is there something in your compression 
routines that could result in a decrease that large just by my _adding_ 
some files to be indexed?  I'm hoping for some illumination in the form 
of ideas about what could trigger such a fortuitous and dramatic result. 
  (For example: One simplistic theory is that you've got a compression 
routine kicks in at 36 meg.  Another is that compression that dramatic 
could result from certain words being eliminated from the index because 
they've crossed some threshold.  [That latter theory seems unlikely to 
me: all I did was add about 20 pages to the website.])

Bill> Two hours seems like a long time to fetch 4000 files.  I suppose you have a
> delay to keep from hitting your server too hard.
> 
> If you use the spider.pl and the keep_alive feature then you should be able
> to spider much faster without much load on the server (depending on your
> available bandwidth, of course).


Thanks!  I'll check into these hints.

The estimate of 4000 files is low because I haven't quite been able to 
get the robot to quit indexing a lot of files that are generated 
on-the-fly by some .pl programs I've got in place, even when I use 
no-index and no-follow.  I'm not sure what I'm doing wrong, but the two 
hours is only a small annoyance in the scheme of things, so I never 
really worked on this matter.  There are some .pl files I _do_ want to 
index, and the other ones I filter out of the Results page seen by the 
user.

Although I believe I didn't change anything along these lines in the 
last round, I'm thinking that one theory of the smaller index file size 
could possibly be related to these .pl files.  Maybe something I did 
_did_ cause the .pl files to no longer be indexed.  I can certainly 
believe that that might cause a precipitous drop in the index file size. 
  If Jose has no immediate ideas, I'll do another run and look at the 
log file to see if this is the answer!

Lauren
Received on Fri Sep 27 12:38:09 2002