I'm running 2.4.0-pr3 on a FreeBSD 5 box. I'm new to SWISH-E (but really
like what I've seen so far), and am having trouble filtering the files I
want to index.
The directory I'm indexing has files that I want to be included of the form
99999.html - the file name consists of a series of digits, followed by
".html". There are other files with numeric names ending in different
"extension" strings, and other names starting with alphabetic characters.
My config file consists of one directive:
FileMatch filename contains ^[0-9]+\.html$
I'm calling the indexer with the command:
swish-e -c site.conf -f abilene.index -e -v 2 -i /www/abilene -T regex
If I grep the output with the strings 37154 and Real_Estate, I see the
following:
File[Rules|Match] filename match 37154.password =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.email =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.html =~ m[^[0-9]+\.html$] : matched
File[Rules|Match] filename match 37154.msgs =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.old =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.renew =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match 37154.views =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match Real_Estate.html =~ m[^[0-9]+\.html$] : nope
File[Rules|Match] filename match Real_Estate.links =~ m[^[0-9]+\.html$] : nope
This leads me to think that the only file from this group added to the
index is 37154.html
But when I run the search command
swish-e -c /usr/local/swish/site.conf -f
/usr/local/swish/index/abilene.index -w "house"
my output is
# SWISH format: 2.4.0-pr3
# Search words: house
# Removed stopwords:
# Number of hits: 7
# Search time: 0.000 seconds
# Run time: 0.016 seconds
1000 /www/abilene/Real_Estate.html "Abilene, TX Real Estate Free
Classifieds" 2271
431 /www/abilene/41094.html "Turn Your Yearly Income into Your Monthly
Income!" 2453
431 /www/abilene/41341.html "Chile Land ForSale" 2649
431 /www/abilene/31133.html "Scotland for Great Family Vacations" 3047
431 /www/abilene/40605.html "antique iron pot" 2118
431 /www/abilene/37985.html "Clerical, word processors needed!" 2070
431 /www/abilene/40812.html "Scotland National Park" 2823
I'm wondering why Real_Estate.html, which shouldn't be in the index, is
coming up as a hit.
I tried using -w "a" in my search query, and get
1000 /www/abilene/41089.gif "41089.gif" 46096
852 /www/abilene/41378.gif "41378.gif" 48146
841 /www/abilene/41377.jpg "41377.jpg" 81223
824 /www/abilene/41128.jpg "41128.jpg" 27751
818 /www/abilene/39308.jpg "39308.jpg" 54406
After trying all sorts of configuration options, I'm wondering if it's just
my ignorance, or perhaps there is something in 2.4.0-pr3 that is causing
the problem. Any suggestions would be greatly appreciated.
Received on Mon Oct 6 21:52:09 2003