Here's an update on indexing a large number of files.
Jose added a patch that vastly improves indexing speed when using the -e
option and a large number of files. 377 temporary files are used while
indexing. Still need to test with different numbers of temp files. I
assume we might have some file systems where that might be too many open
files at a given time.
There's still laminations, but should help for those using -e now.
Here's some examples (commas added):
Linux, running athlon XP 1800+ 512GB RAM, single IDE drive
800,000 files indexed. 1,571,803,843 total bytes. 162,496,911 total words.
Elapsed time: 00:35:22 CPU time: 00:23:36
2,000,000 files indexed. 3,929,488,642 total bytes. 406,262,136 total words.
Elapsed time: 01:28:55 CPU time: 00:-10:-50
Not too bad. I see we have a bug when the CPU time is over 60 minutes.
Memory used while parsing the files never exceeded about 15MB. But sorting
the properties did use quite a bit of RAM (about 200MB) since all the
properties are loaded in RAM at the same time for sorting.
Here's the index size when done:
-rw-r--r-- 1 moseley moseley 2,093,245,767 Jul 5 10:39 index.swish-e
-rw-r--r-- 1 moseley moseley 143,587,414 Jul 5 10:38 index.swish-e.prop
Just about at my 2GB file limit.
Parsing speed (files/second) is consistent.
File 5000 466.15/second over 5000 records.
File 10000 462.43/second over 5000 records.
File 15000 444.95/second over 5000 records.
..
..
File 1995000 435.81/second over 5000 records.
File 2000000 458.03/second over 5000 records.
File 2000000 448.51/second over 2000000 records.
I'm using a program to create the files to index. The files are words
picked at random from a dictionary file. It can generate about 2000
files/second writing to /dev/null.
Without -e parsing slows down over time.
[./prog.pl] Setting file count = 2000000
[./prog.pl] Send a 'kill -hup 2113' to abort
File 5000 573.38/second over 5000 records.
File 10000 566.01/second over 5000 records.
File 15000 552.98/second over 5000 records.
File 20000 549.03/second over 5000 records.
File 25000 529.62/second over 5000 records.
File 30000 529.52/second over 5000 records.
File 35000 508.22/second over 5000 records.
File 40000 513.50/second over 5000 records.
File 45000 490.33/second over 5000 records.
File 50000 498.09/second over 5000 records.
File 55000 474.15/second over 5000 records.
File 60000 483.36/second over 5000 records.
File 65000 458.55/second over 5000 records.
File 70000 469.76/second over 5000 records.
File 75000 444.29/second over 5000 records.
File 80000 456.70/second over 5000 records.
File 85000 429.82/second over 5000 records.
File 90000 445.07/second over 5000 records.
File 95000 418.10/second over 5000 records.
--
Bill Moseley
mailto:moseley@hank.org
Received on Fri Jul 5 18:20:25 2002