Skip to main content.
home | support | download

Back to List Archive

generation of large index failed with swish-e 2.4.3

From: swishe <swishe(at)not-real.ubka.uni-karlsruhe.de>
Date: Mon Jan 31 2005 - 08:41:10 GMT
Hello swish-e community,

we have problems with swish-e 2.4.3 (compiled with large file support,
/configure CPPFLAGS='-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64')

We have ~ 3 million XML records in one file. With header information like
   Content-Length: 1407
   Path-Name: 3
   Document-Type: XML*

   <?xml version="1.0" encoding="utf-8"?>
   ...
   
The file size is ~ 5 GB (zipped ~ 700 MB).
Our Linux server has 6 GB of main memory.

We've tried to build an index in both ways: with and without -e option:
zcat <zipped-xml-file> | swish-e [-e] -v 3 -c <conf-file> -S prog -i stdin
   
In both cases we got incomplete index/prop files with ".temp"-extension.

Without "-e" swish-e does not process all XML records.
It ran out of memory while working on XML record with id 4050780 (== path name). 
end of logfile:
   4050780 - Using XML2 parser -  (60 words)
   err: Ran out of memory (could not allocate 262144 more bytes)!
   .

Using the "-e" option swish-e processed all XML records (id 1111111135
is the last record in our XML file). But it stopped working without
any error message in the logfile and without generating a core dump.
end of logfile:
   1111111135 - Using XML2 parser -  (20 words)

   Removing very common words...
   no words removed.
   Writing main index...
   Sorting words ...
   Sorting 14,395,157 words alphabetically
   Writing header ...
   Writing index entries ...
     Writing word text: ...
   
Perhaps anyone can help us.
Thanks a lot in advance.

Best regards, Uwe Dierolf

--------------------------------------------------------------------------
Uwe Dierolf                       Tel  0721/608-6076
University Library of Karlsruhe   Fax  0721/608-4886
Straße am Forum                   76049 Karlsruhe / Germany
--------------------------------------------------------------------------
Received on Mon Jan 31 00:41:23 2005