Hi all,
I don't give up trying indexing Excel files and I've tried now to index with
-S fs and swish_filter.pl.
It seems that only PDF is parsed with swish_filter.pl. Debugging output see
below.
With spider.pl it seems to be a similar problem.
What am I doying wrong? What am I missing? Have anybody any solution?
Thanks for your help
Leo
swish@bza141:~> swish-e -S prog -c /home/swish/swish-e/conf/swish.conf.test
-T indexed_words
Indexing Data Source: "External-Program"
Indexing "spider.pl"
/home/swish/swish-e/lib/swish-e/spider.pl: Reading parameters from
'/home/swish/swish-e/conf/SpiderConfigTest.pl'
-- Starting to spider: http://bza141/test/excel.xls --
>> +Fetched 0 Cnt: 1 http://bza141/test/excel.xls 200 OK
application/vnd.ms-excel 13824 parent:
http://bza141/test/excel.xls - Using HTML2 parser -
Adding:[1:swishdocpath(11)] 'http' Pos:1 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'bza141' Pos:2 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'test' Pos:3 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'excel' Pos:4 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'xls' Pos:5 Stuct:0x1 ( FILE )
Summary for: http://bza141/test/excel.xls
Total Bytes: 13,824 (13824.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
Adding:[1:swishdefault(1)] 'ðïà' Pos:2 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'á' Pos:3 Stuct:0x9 ( BODY FILE )
(2 words)
swish@bza141:~> swish-e -S fs -c /home/swish/swish-e/conf/swish.conf.local
-T indexed_words
Indexing Data Source: "File-System"
Indexing "/home/swish/swish-e/test"
Checking dir "/home/swish/swish-e/test"...
excel.xls - Using TXT2 parser - Adding:[1:swishdocpath(11)] 'home'
Pos:1 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'swish' Pos:2 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'swish' Pos:3 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'e' Pos:4 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'test' Pos:5 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'excel' Pos:6 Stuct:0x1 ( FILE )
Adding:[1:swishdocpath(11)] 'xls' Pos:7 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'ðï' Pos:1 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'à' Pos:2 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'á' Pos:3 Stuct:0x1 ( FILE )
Adding:[1:swishdefault(1)] 'excel' Pos:4 Stuct:0x1 ( FILE )
(4 words)
text.txt - Using DEFAULT (HTML2) parser - Adding:[2:swishdocpath(11)]
'home' Pos:1 Stuct:0x1 ( FILE )
Adding:[2:swishdocpath(11)] 'swish' Pos:2 Stuct:0x1 ( FILE )
Adding:[2:swishdocpath(11)] 'swish' Pos:3 Stuct:0x1 ( FILE )
Adding:[2:swishdocpath(11)] 'e' Pos:4 Stuct:0x1 ( FILE )
Adding:[2:swishdocpath(11)] 'test' Pos:5 Stuct:0x1 ( FILE )
Adding:[2:swishdocpath(11)] 'text' Pos:6 Stuct:0x1 ( FILE )
Adding:[2:swishdocpath(11)] 'txt' Pos:7 Stuct:0x1 ( FILE )
Adding:[2:swishdefault(1)] 'this' Pos:2 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'is' Pos:3 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'the' Pos:4 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'text' Pos:5 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'for' Pos:6 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'test' Pos:7 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'indexing' Pos:8 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'with' Pos:9 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'swish' Pos:10 Stuct:0x9 ( BODY FILE )
Adding:[2:swishdefault(1)] 'e' Pos:11 Stuct:0x9 ( BODY FILE )
(10 words)
hyper.htm - Using DEFAULT (HTML2) parser - Adding:[3:swishdocpath(11)]
'home' Pos:1 Stuct:0x1 ( FILE )
Adding:[3:swishdocpath(11)] 'swish' Pos:2 Stuct:0x1 ( FILE )
Adding:[3:swishdocpath(11)] 'swish' Pos:3 Stuct:0x1 ( FILE )
Adding:[3:swishdocpath(11)] 'e' Pos:4 Stuct:0x1 ( FILE )
Adding:[3:swishdocpath(11)] 'test' Pos:5 Stuct:0x1 ( FILE )
Adding:[3:swishdocpath(11)] 'hyper' Pos:6 Stuct:0x1 ( FILE )
Adding:[3:swishdocpath(11)] 'htm' Pos:7 Stuct:0x1 ( FILE )
Adding:[3:author(13)] 'leonard' Pos:2 Stuct:0x85 ( META HEAD FILE )
Adding:[3:author(13)] 'bucharow' Pos:3 Stuct:0x85 ( META HEAD FILE
)
Adding:[3:swishdefault(1)] 'mytext' Pos:6 Stuct:0x9 ( BODY FILE )
(3 words)
word.doc - Using TXT2 parser - Adding:[4:swishdocpath(11)] 'home'
Pos:1 Stuct:0x1 ( FILE )
Adding:[4:swishdocpath(11)] 'swish' Pos:2 Stuct:0x1 ( FILE )
Adding:[4:swishdocpath(11)] 'swish' Pos:3 Stuct:0x1 ( FILE )
Adding:[4:swishdocpath(11)] 'e' Pos:4 Stuct:0x1 ( FILE )
Adding:[4:swishdocpath(11)] 'test' Pos:5 Stuct:0x1 ( FILE )
Adding:[4:swishdocpath(11)] 'word' Pos:6 Stuct:0x1 ( FILE )
Adding:[4:swishdocpath(11)] 'doc' Pos:7 Stuct:0x1 ( FILE )
Adding:[4:swishdefault(1)] 'ðï' Pos:1 Stuct:0x1 ( FILE )
Adding:[4:swishdefault(1)] 'à' Pos:2 Stuct:0x1 ( FILE )
Adding:[4:swishdefault(1)] 'á' Pos:3 Stuct:0x1 ( FILE )
Adding:[4:swishdefault(1)] 'ä' Pos:4 Stuct:0x1 ( FILE )
Adding:[4:swishdefault(1)] '9' Pos:5 Stuct:0x1 ( FILE )
Adding:[4:swishdefault(1)] '0' Pos:6 Stuct:0x1 ( FILE )
(6 words)
acrobat.pdf - Using HTML2 parser - Adding:[5:swishdocpath(11)]
'home' Pos:1 Stuct:0x1 ( FILE )
Adding:[5:swishdocpath(11)] 'swish' Pos:2 Stuct:0x1 ( FILE )
Adding:[5:swishdocpath(11)] 'swish' Pos:3 Stuct:0x1 ( FILE )
Adding:[5:swishdocpath(11)] 'e' Pos:4 Stuct:0x1 ( FILE )
Adding:[5:swishdocpath(11)] 'test' Pos:5 Stuct:0x1 ( FILE )
Adding:[5:swishdocpath(11)] 'acrobat' Pos:6 Stuct:0x1 ( FILE )
Adding:[5:swishdocpath(11)] 'pdf' Pos:7 Stuct:0x1 ( FILE )
- Filtered: /home/swish/swish-e/test/acrobat.pdf
Adding:[5:author(13)] 'leonard' Pos:2 Stuct:0x85 ( META HEAD FILE )
Adding:[5:author(13)] 'bucharow' Pos:3 Stuct:0x85 ( META HEAD FILE
)
Adding:[5:swishdefault(1)] 'thu' Pos:7 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'aug' Pos:8 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '21' Pos:9 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '14' Pos:10 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '27' Pos:11 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '49' Pos:12 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '2003' Pos:13 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'acrobat' Pos:16 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'pdfmaker' Pos:17 Stuct:0x5 ( HEAD FILE
)
Adding:[5:swishdefault(1)] '5' Pos:18 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '0' Pos:19 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'für' Pos:20 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'word' Pos:21 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'no' Pos:24 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '25707' Pos:27 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'bytes' Pos:28 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'thu' Pos:31 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'aug' Pos:32 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '21' Pos:33 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '14' Pos:34 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '27' Pos:35 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '53' Pos:36 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '2003' Pos:37 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'yes' Pos:40 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '595' Pos:43 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'x' Pos:44 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '842' Pos:45 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'pts' Pos:46 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'a4' Pos:47 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '1' Pos:50 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '1' Pos:53 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '3' Pos:54 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'acrobat' Pos:57 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'distiller' Pos:58 Stuct:0x5 ( HEAD FILE
)
Adding:[5:swishdefault(1)] '5' Pos:59 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] '0' Pos:60 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'windows' Pos:61 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'yes' Pos:64 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'eins' Pos:67 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'zwei' Pos:68 Stuct:0x5 ( HEAD FILE )
Adding:[5:swishdefault(1)] 'drei' Pos:69 Stuct:0x5 ( HEAD FILE )
(43 words)
Received on Thu Sep 4 11:33:13 2003