On Tue, Jan 27, 2004 at 11:30:06PM -0800, Rob de Santos AFANA wrote:
> IndexDir spider.pl
>
> NoContents .gif .jpg .png .cgi .pl .log .jar .ico .js .class .log .sql
> .csv .dir .idx .dat
> When the indexing runs, swish-e attempts to read and interpret the jpeg
> files rather than simply adding the file path and name to the index as
> indicated in the NoContent directive.
Well, I was going to say that NoContents was not supported when using -S
prog, but then I remembered I added it.
(This is a quickly hacked spider.pl that doesn't skip binary by
default):
$ ./spider.pl default http://localhost/apache/finger.jpg > x
./spider.pl: Reading parameters from 'default'
Summary for: http://localhost/apache/finger.jpg
Total Bytes: 19,645 (19645.0/sec)
Total Docs: 1 (1.0/sec)
Unique URLs: 1 (1.0/sec)
$ swish-e -S prog -i stdin < x | grep 'words indexed'
2,217 unique words indexed.
$ cat c
NoContents .jpg
$ swish-e -S prog -i stdin -c c < x | grep 'words indexed'
5 unique words indexed.
$ swish-e -w not dkdk
# SWISH format: 2.4.1
# Search words: not dkdk
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.042 seconds
1000 http://localhost/apache/finger.jpg "finger.jpg" 19645
.
So it does work. But, what I would recommend is in your spider config
file filter_content() function when you see those extensions do
$$content_ref = $uri;
return 1;
So you are just replacing the content with the path. That also avoids
sending all that data onto swish where it will just discard it.
Hope that helps.
--
Bill Moseley
moseley@hank.org
Received on Tue Jan 27 23:52:00 2004