In message <20030903050346.GE12666@hank.org>,
Bill Moseley writes:
>On Mon, Sep 01, 2003 at 01:35:23PM -0700, John P. Rouillard wrote:
>> Hi all:
>> This occurs with both a 2.1dev25 and a 2.4pr1 release.
>>
>> Running:
>> /tools/swish_e-2.1dev25/bin/swish-e -w guest -TINDEX_WORDS_FULL -f hyperma
>il.idx | less
>>
>> I find:
>>
>> guest
>> Meta:10 http://XXXXX/mailing-lists/ZZZZZ/0016.html Freq:4 Pos/Struct:138/
>9,172/9,201/9,211/9
>> Meta:10 http://XXXXX/mailing-lists/YYYYYY/0082.html Freq:1 Pos/Struc::122
>/9
[...]
>> Running:
>>
>> /tools/swish_e-2.1dev25/bin/swish-e -w guest -f hypermail.idx
>> # SWISH format: 2.1-dev-25
>> # Search words: guest
>> err: no results
>> .
>>
>> Huh?? Any idea what's happening here? The same thing happens if I use
>> /tools/swish_e-2.4.0_pr1/bin/swish-e
>
>Looks to me like "guest" is indexed under metaname number 10.
>
> swish-e -T index_metanames
>
>should show you the name of meta ID 10.
Hmm, 10 is swishtitle. Wierd. I wonder why its not showing up under
swishdefault since swishtitle should be in swishdefault should mirror
each other right?
-----> METANAMES for hypermail.idx <-----
swishdefault : id= 1 type= 1 META_INDEX Rank Bias= 0
swishreccount : id= 2 type=42 META_INTERNAL META_PROP:NUMBER
swishrank : id= 3 type=42 META_INTERNAL META_PROP:NUMBER
swishfilenum : id= 4 type=42 META_INTERNAL META_PROP:NUMBER
swishdbfile : id= 5 type=38 META_INTERNAL META_PROP:STRING(case:compare)
swishdocpath : id= 6 type= 6 META_PROP:STRING(case:compare) *presorted*
swishtitle : id= 7 type=70 META_PROP:STRING(case:ignore) *presorted*
swishdocsize : id= 8 type=10 META_PROP:NUMBER *presorted*
swishlastmodified : id= 9 type=18 META_PROP:DATE *presorted*
swishtitle : id=10 type= 1 META_INDEX Rank Bias= 0
name : id=11 type= 1 META_INDEX Rank Bias= 0
email : id=12 type= 1 META_INDEX Rank Bias= 0
name : id=13 type=70 META_PROP:STRING(case:ignore) *presorted*
email : id=14 type=70 META_PROP:STRING(case:ignore) *presorted*
sent : id=15 type=18 META_PROP:DATE *presorted*
swishdescription : id=16 type= 6 META_PROP:STRING(case:compare) *presorted*
swishdocpath : id=17 type= 1 META_INDEX Rank Bias= 0
title : id=18 type= 1 META_INDEX Rank Bias= 0 [Alias for swishtitle (10)]
path : id=19 type= 1 META_INDEX Rank Bias= 0 [Alias for swishdocpath (17)]
What is wierd is that I am seeing this on two other indexes as
well. In one case its indexed under metaname id 11 that is also the
swishtitle. This is wierd. I am spidering for the other two indexes,
and the hypermail program is producing valid HTML, but its not being
indexed under swishdefault.
What should I be looking at to see why swishdefault is not being
populated?
I have tried:
% /tools/swish_e-2.4.0_pr1/share/doc/swish-e/examples/prog-bin/\
index_hypermail.pl /data/www/mailing-lists/admin/0016.html > test.html
% /tools/swish_e-2.4.0_pr1/bin/swish-e -i test.html -T indexed_words
Indexing Data Source: "File-System"
Indexing "test.html"
...
Adding:[1:swishdefault(1)] 'guest' Pos:172 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'guest' Pos:206 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'guest' Pos:235 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'guest' Pos:245 Stuct:0x9 ( BODY FILE )
...
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 141 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
141 unique words indexed.
4 properties sorted.
1 file indexed. 3085 total bytes. 241 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
Which shows that guest is swishdefault.
% /tools/swish_e-2.4.0_pr1/bin/swish-e -w guest
# SWISH format: 2.4.0-pr1
# Search words: guest
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.023 seconds
1000 test.html "TWiki security setup." 3085
So the simple test case works. Doing a guest search on the entire
directory tree returns no hits, my config file is:
IndexName "Majordomo Mailing list archives"
IndexDescription "Index of Majordomo mailing list archives"
IndexPointer "http://XXXXXX/mailing-lists"
IndexAdmin "admin@example.com
IndexDir /tools/swish_e-2.1dev25/lib/swish-e/progs/index_hypermail.pl
IndexFile /data/www/swish-e/hypermail.idx
SwishProgParameters /data/www/mailing-lists/*
ReplaceRules replace "/data/www/" "http://XXXXX/"
MetaNames swishtitle name email
PropertyNames name email
PropertyNamesDate sent
IndexContents HTML2 .html
StoreDescription HTML2 <body> 100000
UndefinedMetaTags ignore
IncludeConfigFile /home/jrouilla/develop/search/filters.cf
MetaNames swishdocpath
MetaNameAlias swishtitle title
MetaNameAlias swishdocpath path
The config file filters.cf is:
# include all the available filters and mappings for
# files that we index
FileFilter .pdf /tools/swish_e-2.1dev25/lib/swish-e/filters/_pdf2html.pl
FileFilter .PDF /tools/swish_e-2.1dev25/lib/swish-e/filters/_pdf2html.pl
Filefilter .ppt /tools/xlhtml-0.5.1/bin/ppthtml "'%p'"
FileFilter .doc /tools/catdoc-0.91.5/bin/catdoc "-a -s8859-1 -d8859-1 '%p'"
FileFilter .xls /tools/xlhtml-0.5.1/bin/xlhtml "-nc '%p'"
FileFilter .exe /usr/bin/strings "'%p'"
FileFilter .rpm /bin/rpm "-qil -p '%p'"
FileFilter .zip /usr/bin/unzip "-v \"%p\""
FileFilter .ZIP /usr/bin/unzip "-v \"%p\""
FileFilter .tar.Z /bin/tar "-tzvf '%p'"
FileFilter .tar.gz /bin/tar "-tzvf '%p'"
FileFilter .tgz /bin/tar "-tZvf '%p'"
FileFilter .tar /bin/tar "-tvf '%p'"
FileFilter .gz /bin/gunzip "-c '%p'"
FileFilter .z /bin/gunzip "-c '%p'"
FileFilter .Z /bin/gunzip "-c '%p'"
FileFilter .ps /usr/bin/ps2ascii "'%p'"
FileFilter .rtf /usr/bin/strings "'%p'"
IndexContents HTML2 .pdf .ppt .PDF .xls
IndexContents TXT2 .doc .xls .exe .zip .ZIP .tar.Z .tar.gz .tgz .tar
IndexContents TXT2 .gz .z .Z .ps .rtf
-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.
Received on Wed Sep 3 15:12:13 2003