On Tue, Jan 27, 2004 at 11:30:06PM -0800, Rob de Santos AFANA wrote:
> IndexDir spider.pl
> NoContents .gif .jpg .png .cgi .pl .log .jar .ico .js .class .log .sql
> .csv .dir .idx .dat
> When the indexing runs, swish-e attempts to read and interpret the jpeg
> files rather than simply adding the file path and name to the index as
> indicated in the NoContent directive.
Well, I was going to say that NoContents was not supported when using -S
prog, but then I remembered I added it.
(This is a quickly hacked spider.pl that doesn't skip binary by
$ ./spider.pl default http://localhost/apache/finger.jpg > x
./spider.pl: Reading parameters from 'default'
Summary for: http://localhost/apache/finger.jpg
Total Bytes: 19,645 (19645.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
$ swish-e -S prog -i stdin < x | grep 'words indexed'
2,217 unique words indexed.
$ cat c
$ swish-e -S prog -i stdin -c c < x | grep 'words indexed'
5 unique words indexed.
$ swish-e -w not dkdk
# SWISH format: 2.4.1
# Search words: not dkdk
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.042 seconds
1000 http://localhost/apache/finger.jpg "finger.jpg" 19645
So it does work. But, what I would recommend is in your spider config
file filter_content() function when you see those extensions do
$$content_ref = $uri;
So you are just replacing the content with the path. That also avoids
sending all that data onto swish where it will just discard it.
Hope that helps.
Received on Tue Jan 27 23:52:00 2004