Skip to main content.
home | support | download

Back to List Archive

Indexing differs for 2 lines swapped in file

From: Dominique Phommahaxay <dominique.phommahaxay(at)not-real.writeme.com>
Date: Sat Oct 25 2003 - 06:24:43 GMT
All,

I hope this is not a duplicate issue that I missed when looking for indexing issue on the forum.

Here is what is happening:

0. Configuration
================

- Win XP Home
- swish-e:
  . version: 2.2.3

C:\private\work\jway>swish-e -V
SWISH-E 2.2.3

  . configuration:

C:\private\work\jway>type conf.txt
IndexDir C:\private\work\jway\BTTITLE01312003
IndexOnly .CSV

1. Test 1
=========

1.1. File 1
===========
. name: c:\private\work\jway\BTTITLE01312003\BTTITLE01312003.CSV
. size: 180,342,112 bytes
. content: 565328 records of books formated as follow:

ISBN|Binding|Title|Edition|Merchandise Category|LanguageCD|Author|...

. contains the word J2Ee at record 272191:

0672317958|PAP|Building Java Enterprise System With J2Ee|BOOK & CD|COM|ENG|Perrone, Paul J./ Chaganti, Venkata S.R.R.|...

1.2. Indexing File 1
====================

C:\private\work\jway>swish-e -c conf.txt
Indexing Data Source: "File-System"
Indexing "C:\private\work\jway\BTTITLE01312003"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 172494 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
172494 unique words indexed.
4 properties sorted.
1 file indexed.  180342112 total bytes.  2272318 total words.
Elapsed time: 00:00:48 CPU time: 00:00:48
Indexing done!

C:\private\work\jway>dir
 Volume in drive C has no label.
 Volume Serial Number is BC31-40DA

 Directory of C:\private\work\jway

10/25/2003  01:33 AM    <DIR>          .
10/25/2003  01:33 AM    <DIR>          ..
10/25/2003  01:07 AM    <DIR>          BTTITLE01312003
02/03/2003  04:41 PM        55,721,416 BTTITLE01312003.ZIP
10/22/2003  10:51 AM                61 conf.txt
10/25/2003  01:16 AM         8,979,569 index.swish-e
10/25/2003  01:16 AM                71 index.swish-e.prop
               4 File(s)     64,701,117 bytes
               3 Dir(s)  43,533,873,152 bytes free

C:\private\work\jway>

1.3. Search for J2Ee
====================

C:\private\work\jway>swish-e -w J*
# SWISH format: 2.2.3
# Search words: J*
# Number of hits: 1
# Search time: 0.060 seconds
# Run time: 0.080 seconds
937 C:/private/work/jway/BTTITLE01312003/BTTitle01312003.CSV "BTTitle01312003.CSV" 180342112
.

C:\private\work\jway>swish-e -w J2*
# SWISH format: 2.2.3
# Search words: J2*
err: no results
.

C:\private\work\jway>swish-e -w J2E*
# SWISH format: 2.2.3
# Search words: J2E*
err: no results
.

C:\private\work\jway>swish-e -w J2Ee*
# SWISH format: 2.2.3
# Search words: J2Ee*
err: no results
.

C:\private\work\jway>swish-e -w J2Ee
# SWISH format: 2.2.3
# Search words: J2Ee
err: no results
.

C:\private\work\jway>swish-e -w Building Java Enterprise System With
# SWISH format: 2.2.3
# Search words: Building Java Enterprise System With
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.030 seconds
1000 C:/private/work/jway/BTTITLE01312003/BTTitle01312003.CSV "BTTitle01312003.CSV" 180342112
.

C:\private\work\jway>swish-e -w Building Java Enterprise System With J2Ee
# SWISH format: 2.2.3
# Search words: Building Java Enterprise System With J2Ee
err: no results
.

C:\private\work\jway>

1.4. Conclusion
===============

After indexing File 1, the search cannot find J2Ee.

2. Test 2
=========

2.1. File 1
===========
Renamed file 1 to: c:\private\work\jway\BTTITLE01312003\BTTITLE01312003.CSV.original.txt to prevent indexing.

2.2. File 2
===========
After a manual dichotomy search, file 2 becomes the first 15649 records of File 1 to which was appended the record containing J2Ee

. name: c:\private\work\jway\BTTITLE01312003\BTTITLE01312003_2.CSV
. size: 5,252,836 bytes
. content: 15650 records of books formated as follow:

ISBN|Binding|Title|Edition|Merchandise Category|LanguageCD|Author|...

. contains the word J2Ee at record 15650:

0672317958|PAP|Building Java Enterprise System With J2Ee|BOOK & CD|COM|ENG|Perrone, Paul J./ Chaganti, Venkata S.R.R.|...

2.3. Deleting current index
===========================

C:\private\work\jway>del index.*

C:\private\work\jway>dir
 Volume in drive C has no label.
 Volume Serial Number is BC31-40DA

 Directory of C:\private\work\jway

10/25/2003  01:49 AM    <DIR>          .
10/25/2003  01:49 AM    <DIR>          ..
10/25/2003  01:45 AM    <DIR>          BTTITLE01312003
02/03/2003  04:41 PM        55,721,416 BTTITLE01312003.ZIP
10/22/2003  10:51 AM                61 conf.txt
               2 File(s)     55,721,477 bytes
               3 Dir(s)  43,547,041,792 bytes free

C:\private\work\jway>dir

2.4. Indexing File 2
====================

C:\private\work\jway>swish-e -c conf.txt
Indexing Data Source: "File-System"
Indexing "C:\private\work\jway\BTTITLE01312003"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 84120 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
84120 unique words indexed.
4 properties sorted.
1 file indexed.  5252836 total bytes.  962359 total words.
Elapsed time: 00:00:04 CPU time: 00:00:04
Indexing done!

C:\private\work\jway>

2.5. Search for J2Ee
====================

C:\private\work\jway>swish-e -w J2*
# SWISH format: 2.2.3
# Search words: J2*
err: no results
.

C:\private\work\jway>swish-e -w J2E*
# SWISH format: 2.2.3
# Search words: J2E*
err: no results
.

C:\private\work\jway>swish-e -w J2Ee*
# SWISH format: 2.2.3
# Search words: J2Ee*
err: no results
.

C:\private\work\jway>swish-e -w J2Ee
# SWISH format: 2.2.3
# Search words: J2Ee
err: no results
.

C:\private\work\jway>swish-e -w Building Java Enterprise System With J2Ee
# SWISH format: 2.2.3
# Search words: Building Java Enterprise System With J2Ee
err: no results
.

C:\private\work\jway>swish-e -w Building Java Enterprise System With
# SWISH format: 2.2.3
# Search words: Building Java Enterprise System With
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.040 seconds
1000 C:/private/work/jway/BTTITLE01312003/BTTitle01312003-1.csv "BTTitle01312003-1.csv" 5252836
.

C:\private\work\jway>

2.6. Conclusion
===============

After indexing File 2, the search cannot find J2Ee.

3. Test 3
=========

3.1. File 2
===========

In File 2 swap the last record containing J2Ee (at position 15650) with the record at position 15649 (J2Ee becomes the record 

before last).

. name: c:\private\work\jway\BTTITLE01312003\BTTITLE01312003_2.CSV
. size: 5,252,836 bytes
. content: 15650 records of books formated as follow:

ISBN|Binding|Title|Edition|Merchandise Category|LanguageCD|Author|...

. contains the word J2Ee at record 15649:

0672317958|PAP|Building Java Enterprise System With J2Ee|BOOK & CD|COM|ENG|Perrone, Paul J./ Chaganti, Venkata S.R.R.|...

3.2. Deleting current index
===========================

C:\private\work\jway>del index.*

C:\private\work\jway>dir
 Volume in drive C has no label.
 Volume Serial Number is BC31-40DA

 Directory of C:\private\work\jway

10/25/2003  02:00 AM    <DIR>          .
10/25/2003  02:00 AM    <DIR>          ..
10/25/2003  01:45 AM    <DIR>          BTTITLE01312003
02/03/2003  04:41 PM        55,721,416 BTTITLE01312003.ZIP
10/22/2003  10:51 AM                61 conf.txt
               2 File(s)     55,721,477 bytes
               3 Dir(s)  43,548,086,272 bytes free

C:\private\work\jway>

3.3. Indexing File 2
====================

C:\private\work\jway>swish-e -c conf.txt
Indexing Data Source: "File-System"
Indexing "C:\private\work\jway\BTTITLE01312003"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 84128 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
84128 unique words indexed.
4 properties sorted.
1 file indexed.  5252836 total bytes.  962444 total words.
Elapsed time: 00:00:04 CPU time: 00:00:04
Indexing done!

C:\private\work\jway>


Note here the output diffence with "2.4. Indexing File 2":

. Sorting x words alphabetically
. x unique words indexed.
. y total words.

3.4. Search for J2Ee
====================

C:\private\work\jway>swish-e -w J2Ee
# SWISH format: 2.2.3
# Search words: J2Ee
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.020 seconds
1000 C:/private/work/jway/BTTITLE01312003/BTTitle01312003-1.csv "BTTitle01312003-1.csv" 5252836
.

C:\private\work\jway>swish-e -w J2*
# SWISH format: 2.2.3
# Search words: J2*
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.020 seconds
1000 C:/private/work/jway/BTTITLE01312003/BTTitle01312003-1.csv "BTTitle01312003-1.csv" 5252836
.

C:\private\work\jway>swish-e -w J2E*
# SWISH format: 2.2.3
# Search words: J2E*
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.020 seconds
1000 C:/private/work/jway/BTTITLE01312003/BTTitle01312003-1.csv "BTTitle01312003-1.csv" 5252836
.

C:\private\work\jway>swish-e -w J2Ee*
# SWISH format: 2.2.3
# Search words: J2Ee*
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.020 seconds
1000 C:/private/work/jway/BTTITLE01312003/BTTitle01312003-1.csv "BTTitle01312003-1.csv" 5252836
.

C:\private\work\jway>swish-e -w J2Ee
# SWISH format: 2.2.3
# Search words: J2Ee
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.020 seconds
1000 C:/private/work/jway/BTTITLE01312003/BTTitle01312003-1.csv "BTTitle01312003-1.csv" 5252836
.

C:\private\work\jway>swish-e -w Building Java Enterprise System With J2Ee
# SWISH format: 2.2.3
# Search words: Building Java Enterprise System With J2Ee
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.020 seconds
1000 C:/private/work/jway/BTTITLE01312003/BTTitle01312003-1.csv "BTTitle01312003-1.csv" 5252836
.

C:\private\work\jway>swish-e -w Building Java Enterprise System With
# SWISH format: 2.2.3
# Search words: Building Java Enterprise System With
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.020 seconds
1000 C:/private/work/jway/BTTITLE01312003/BTTitle01312003-1.csv "BTTitle01312003-1.csv" 5252836
.

C:\private\work\jway>

3.5. Conclusion
===============

After indexing File 2, the search found J2Ee.

3.6. Note
=========

The reason for downsizing the File 2 to its size resulted from multiple trials to find the limit when J2Ee can be found and somehow the number of records reached that specific number. The record containing J2Ee can be saved from record 1 to the record before last with proper search result on J2Ee.

4. Conclusion
=============

Based on the Test 2 and Test 3:

. swapping 2 records in a file leads to different indexing output and search results (which is incorrect).
. the search is returning proper results as long as the indexing is properly processed.

5. What now?
============
How can I help solving this issue (providing assistance, uploading files for test -- they are huge files...)? Please advise.


Regards,

Dominique Phommahaxay (dominique dot phommahaxay at writeme dot com)
-- 
__________________________________________________________
Sign-up for your own personalized E-mail at Mail.com
http://www.mail.com/?sr=signup

CareerBuilder.com has over 400,000 jobs. Be smarter about your job search
http://corp.mail.com/careers
Received on Sat Oct 25 06:37:14 2003