I wrote a little test program that randomly selects words from a dictionary
and builds a "file" to index using -S prog.
It's clear how performance drops off with the number of files indexed. It
may be that our hashes can be tuned better. But it also looks like using
-e is a very good thing to do. Well, to a point.
In this first test I let it run to about 100,000 files of average size of
1,964 bytes. This used about 1/4G or RAM.
PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
2028 pts/0 R 5:44 224 243 248828 244616 47.6 ./swish-e -S prog
-c
moseley@bumby:~/swish-e/src$ ./swish-e -S prog -c c
Indexing Data Source: "External-Program"
Indexing "./prog.pl"
[./prog.pl] Setting file count = 1000000
[./prog.pl] Send a 'kill -hup 2029' to abort
File 5000 534.16/second over 5000 records.
File 10000 483.20/second over 5000 records.
File 15000 441.84/second over 5000 records.
File 20000 407.12/second over 5000 records.
File 25000 377.15/second over 5000 records.
File 30000 351.27/second over 5000 records.
File 35000 328.86/second over 5000 records.
File 40000 308.51/second over 5000 records.
File 45000 289.61/second over 5000 records.
File 50000 273.41/second over 5000 records.
File 55000 257.65/second over 5000 records.
File 60000 243.72/second over 5000 records.
File 65000 231.59/second over 5000 records.
File 70000 220.66/second over 5000 records.
File 75000 210.78/second over 5000 records.
File 80000 201.33/second over 5000 records.
File 85000 192.95/second over 5000 records.
File 90000 185.03/second over 5000 records.
File 95000 177.96/second over 5000 records.
File 100000 170.70/second over 5000 records.
[./prog.pl] Aborted at record 104714
File 104714 255.56/second over 104714 records.
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 45373 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
45373 unique words indexed.
4 properties sorted.
104713 files indexed. 205733891 total bytes. 21268870 total words.
Elapsed time: 00:07:02 CPU time: 00:06:06
Indexing done!
Now, here's using -e
Memory usage is much better 10MB instead of 1/4GB, and better over all file
processing speed. 310 files per second with -e vs. 255/second without -e.
PID TTY STAT TIME MAJFL TRS DRS RSS %MEM COMMAND
2063 pts/0 R 8:31 232 243 12432 10348 2.0 ./swish-e -S prog
-c c
moseley@bumby:~/swish-e/src$ ./swish-e -S prog -c c -e
Indexing Data Source: "External-Program"
Indexing "./prog.pl"
[./prog.pl] Setting file count = 1000000
[./prog.pl] Send a 'kill -hup 2064' to abort
File 5000 313.82/second over 5000 records.
File 10000 311.03/second over 5000 records.
File 15000 310.38/second over 5000 records.
File 20000 309.69/second over 5000 records.
File 25000 309.97/second over 5000 records.
File 30000 310.03/second over 5000 records.
File 35000 310.05/second over 5000 records.
File 40000 310.00/second over 5000 records.
File 45000 309.54/second over 5000 records.
File 50000 311.07/second over 5000 records.
File 55000 310.09/second over 5000 records.
File 60000 310.48/second over 5000 records.
File 65000 310.81/second over 5000 records.
File 70000 310.61/second over 5000 records.
File 75000 310.90/second over 5000 records.
File 80000 310.14/second over 5000 records.
File 85000 310.55/second over 5000 records.
File 90000 311.14/second over 5000 records.
File 95000 309.71/second over 5000 records.
File 100000 309.83/second over 5000 records.
File 105000 310.51/second over 5000 records.
File 110000 310.95/second over 5000 records.
File 115000 310.43/second over 5000 records.
File 120000 310.54/second over 5000 records.
File 125000 309.89/second over 5000 records.
[./prog.pl] Aborted at record 128714
File 128714 310.39/second over 128714 records.
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 45373 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
45373 unique words indexed.
4 properties sorted.
128713 files indexed. 252888063 total bytes. 26143735 total words.
Elapsed time: 00:09:46 CPU time: 00:08:34
Indexing done
I tried indexing 1,000,000 files but exceeded my 2GB:
File 880000 309.66/second over 5000 records.
File size limit exceeded
That took about a little an hour and a quarter to get to that point.
--
Bill Moseley
mailto:moseley@hank.org
Received on Thu Jun 13 20:53:31 2002