Skip to main content.
home | support | download

Back to List Archive

Re: Big indexes

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jul 05 2002 - 18:16:47 GMT
Here's an update on indexing a large number of files.

Jose added a patch that vastly improves indexing speed when using the -e
option and a large number of files.  377 temporary files are used while
indexing.  Still need to test with different numbers of temp files.  I
assume we might have some file systems where that might be too many open
files at a given time.

There's still laminations, but should help for those using -e now.

Here's some examples (commas added):  
Linux, running athlon XP 1800+ 512GB RAM, single IDE drive

800,000 files indexed.  1,571,803,843 total bytes.  162,496,911 total words.
Elapsed time: 00:35:22 CPU time: 00:23:36

2,000,000 files indexed.  3,929,488,642 total bytes.  406,262,136 total words.
Elapsed time: 01:28:55 CPU time: 00:-10:-50

Not too bad.  I see we have a bug when the CPU time is over 60 minutes.

Memory used while parsing the files never exceeded about 15MB.  But sorting
the properties did use quite a bit of RAM (about 200MB) since all the
properties are loaded in RAM at the same time for sorting.  

Here's the index size when done:

-rw-r--r--    1 moseley  moseley  2,093,245,767 Jul  5 10:39 index.swish-e
-rw-r--r--    1 moseley  moseley  143,587,414 Jul  5 10:38 index.swish-e.prop

Just about at my 2GB file limit.

Parsing speed (files/second) is consistent.

File 5000  466.15/second over 5000 records.
File 10000  462.43/second over 5000 records.
File 15000  444.95/second over 5000 records.
..
..
File 1995000  435.81/second over 5000 records.
File 2000000  458.03/second over 5000 records.
File 2000000  448.51/second over 2000000 records.

I'm using a program to create the files to index.  The files are words
picked at random from a dictionary file.  It can generate about 2000
files/second writing to /dev/null.

Without -e parsing slows down over time.

[./prog.pl] Setting file count = 2000000
[./prog.pl] Send a 'kill -hup 2113' to abort
File 5000  573.38/second over 5000 records.
File 10000  566.01/second over 5000 records.
File 15000  552.98/second over 5000 records.
File 20000  549.03/second over 5000 records.
File 25000  529.62/second over 5000 records.
File 30000  529.52/second over 5000 records.
File 35000  508.22/second over 5000 records.
File 40000  513.50/second over 5000 records.
File 45000  490.33/second over 5000 records.
File 50000  498.09/second over 5000 records.
File 55000  474.15/second over 5000 records.
File 60000  483.36/second over 5000 records.
File 65000  458.55/second over 5000 records.
File 70000  469.76/second over 5000 records.
File 75000  444.29/second over 5000 records.
File 80000  456.70/second over 5000 records.
File 85000  429.82/second over 5000 records.
File 90000  445.07/second over 5000 records.
File 95000  418.10/second over 5000 records.




-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Jul 5 18:20:25 2002