Skip to main content.
home | support | download

Back to List Archive

reducing index size with config-option MetaNames ?

From: swishe <swishe(at)not-real.ubka.uni-karlsruhe.de>
Date: Tue Dec 14 2004 - 09:53:26 GMT
Hello,

we are building very large indexes (> 1 million records).
Input are XML files / streams of XML-files using -S prog -i stdin.

We thought that we can reduce the size of the swish-e index in using
MetaNames. Normally we are using UndefinedMetaTags auto.

We believed that only strings found in XML elements
which are declared by MetaNames will used for indexing.
But swish-e is always indexing all words in all XML elements
(see below).

Is MetaNames really only to limit the search to just the words 
contained in that META name?

Is there a way to prevent words from being used for the index by swish-e?
Or do we have to exclude these XML elements from the input files?

Thanks a lot in advance to all people who develop(ed) this wonderful 
easy to use and extremly fast tool.

Bye, Uwe

------------------------------------------------------------------
Uwe Dierolf
University of Karlsruhe - University Library
P.O.Box 6920, 76049 Karlsruhe, Germany
phone(fax) : 49/721/608-6076(4886)
www        : http://www.ubka.uni-karlsruhe.de/dierolf/
------------------------------------------------------------------


xml-records in separate files
-----------------------------
1.xml
-----
<record>
    <id>1</id>
    <string>record1</string>
</record>

2.xml
-----
<record>
    <id>2</id>
    <string>record2</string>
</record>


conf file
---------
IndexDir      .
IndexOnly     .xml
IndexContents XML2 .xml
IndexFile     ./test.index
IndexReport 1
FuzzyIndexingMode None
WordCharacters  0123456789abcdefghijklmnopqrstuvwxyz
BeginCharacters 0123456789abcdefghijklmnopqrstuvwxyz
EndCharacters   0123456789abcdefghijklmnopqrstuvwxyz
MetaNames     string
PropertyNames string


index creation: swish-e -c conf
-------------------------------
Indexing Data Source: "File-System"
Indexing "."
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 4 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
4 unique words indexed.
5 properties sorted.
2 files indexed.  126 total bytes.  4 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!


indexed words: swish-e -f test.index -T INDEX_WORDS_META
---------------------------------------------------------

-----> WORD INFO in index test.index <-----

1       1
2       1
record1 10
record2 10
Received on Tue Dec 14 01:53:49 2004