Here's an interesting experiment. I'm running Swish-E on two machine that
are side-by-side on my desk. Both have the latest versions of Swish-E from
the download page for their respective OS (downloaded and installed on both
in the last five days). Here are the hardware specs and what the "Swish-E
-V" (version label) command returns for both of them:
Machine A: Athlon 750 Mhz, 224MB RAM running Mandrake Linux 8.1 ("SWISH-E
Machine B: Athlon 1 GHz, 384MB RAM running Windows XP Home ("SWISH-E
These two machines share the same internet connection (both connected to the
same gateway -- not one sharing the other's connection), and both present
comparable performance when surfing the web and downloading files.
Okay, they both have Apache, so I set them both loose on the Apache manual
via the file system. Here's how long it took:
Machine A (Linux / Swish 2.0): 3 seconds
Machine B (Windows / Swish 2.1-dev-24): 3 seconds
Perfectly comparable...when indexing via the file system
Here's where it gets interesting: I set up the swishspider and unleashed
them both on the same web site (very small -- just 19 unique pages) via HTTP
crawl at the same general time (one just after another, late at night when
volume was low; web server logs indicate that the spider was the only active
session on the web site at the time).
The time differences were massive:
Machine A (Linux / Swish 2.0): 21 minutes (that's MINUTES, not seconds...)
Machine B (Windows / Swish 2.1-dev-24): 14 seconds
This is not a fluke -- I did the same test several times and got the same
The test is also informally mirrored. I have Swish-E running at work on
Windows 2000 Professional, and a friend has it running on Mandrake Linux
8.1, both with the same version numbers (Windows at 2.1 dev, Linux at 2.0).
Performance in both instances is representative of the respective times
So, where does the difference come from? It has to be something to do with
the spider since they have the same performance indexing via the file
system. Is it:
(1) A difference in the versions? I know that spidering and indexing time
was improved in the new release, but improved THAT much? Wow.
(2) A difference in the underlying operating systems? Could Windows and
Linux handle HTTP requests and HTML parsing THAT differently?
I researched this on the discussion group and found this post:
This indicates that the system will page at the tail end of the crawl when
it says "Writing index entries...". However, that's not the problem here.
The Linux machine is just slow from page to page when indexing. The output
says something like, "Retrieving page http://blah.blah..." and it just
sits...and sits...and sits...and then moves on.
The Sling and Rock Design Group
Received on Fri Jan 11 20:12:44 2002