Hi list,
I am seeing a "strange" behavior when indexing a large amount of data.
(~22GB including images, PDF files, MS Word files, but only .htm, .html
and .txt files are indexed (~157,000 files)).
The problem is that I am getting 2 different number of indexed words
from the same data; for example, some output lines after execute:
swish-e –c swish.config
(you can see the config file at the end of this email) are:
==========-Output1 begin==========
Parsing config file 'swish.conf'
Indexing Data Source: "File-System"
Indexing "../disk2/Info"
..
In dir "../disk2/Info/ebsp/apac/cn":
benefits.htm - Using HTML2 parser - (43 words)
..
617,504 unique words indexed.
Sorting property: swishdocpath
Sorting property: swishtitle
Sorting property: swishdocsize
Sorting property: swishlastmodified
Sorting property: swishdescription
5 properties sorted.
157,686 files indexed. 2,965,919,331 total bytes. 121,765,998 total words.
Elapsed time: 01:21:49 CPU time: 00:13:25
Indexing done!
==========Output1 end==========
Other Output from the same command:
==========Output2 begin==========
Parsing config file 'swish.conf'
Indexing Data Source: "File-System"
Indexing "../disk2/Info"
..
In dir "../disk2/Info/ebsp/apac/cn":
benefits.htm - Using HTML2 parser - (40 words)
..
617,584 unique words indexed.
Sorting property: swishdocpath
Sorting property: swishtitle
Sorting property: swishdocsize
Sorting property: swishlastmodified
Sorting property: swishdescription
5 properties sorted.
157,686 files indexed. 2,965,919,331 total bytes. 121,765,813 total words.
Elapsed time: 01:26:31 CPU time: 00:12:59
Indexing done!
==========Output2 end==========
As you can see, the number of unique indexed words and total words are
different.
After the indexing process is finished I extract the keywords with the
command:
swish-e -k* > swish_keyword.out
and I realized that there is a pattern in the keyword files' size
for example:
macr@linux:~/SearchEngine/Golden> ls –l
-rw-r--r-- 1 macr users 5172418 2006-04-10 08:51 swish_keyword.out1
-rw-r--r-- 1 macr users 5173104 2006-04-10 09:06 swish_keyword.out2
-rw-r--r-- 1 macr users 5172418 2006-04-10 09:17 swish_keyword.out3
-rw-r--r-- 1 macr users 5173104 2006-04-10 10:08 swish_keyword.out4
Notice that output file 1 is equal to output file 3 and output file 2 is
equal to output file 4. This pattern is consistent if I continue indexing
and extracting the keywords.
I've only seen this behavior when indexing all the information;
if I index just a few directories I got the same number of indexed
words always.
Here is my system description:
OS: SuSE Linux Enterprise Server 9 Service Pack 3(kernel-2.6.5-smp)
CPU: Intel (R) Pentium 4 (3.00 GHz with HT)
RAM: 1Gb
SWISH-E 2.4.3
libxml2-2.6.7
And my swish.config file is:
========= swish.config begin ==========
IndexReport 3
IndexDir ../disk2/Info
IndexOnly .htm .html .txt
IndexContents TXT2 .txt
DefaultContents HTML2
StoreDescription HTML2 <body> 80
#Filesystem in ../disk2 is ext3
ReplaceRules replace "../disk2/Info"
========= swish.config end ==========
Any idea why this is happening?
Best Regards,
Rodolfo.
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Received on Thu Apr 20 16:23:58 2006