Skip to main content.
home | support | download

Back to List Archive

Re: PropertyNames not working in 2.1-dev-24

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Nov 16 2001 - 18:54:45 GMT
Try something simple first.

> cat c
propertynames foo description


> cat 1.html
<meta name="foo" content="bar">
hello
  <META NAME="DESCRIPTION"
  CONTENT="UK based 
           Wine Shop">

> ./swish-e -c c -i 1.html -T indexed_words properties
Indexing Data Source: "File-System"
Indexing "1.html"
    Adding:[swishdefault:1]   'bar'   Pos:2  Stuct:0x81 ( META FILE )
    Adding:[swishdefault:1]   'hello'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[swishdefault:1]   'uk'   Pos:6  Stuct:0x81 ( META FILE )
    Adding:[swishdefault:1]   'based'   Pos:7  Stuct:0x81 ( META FILE )
    Adding:[swishdefault:1]   'wine'   Pos:8  Stuct:0x81 ( META FILE )
    Adding:[swishdefault:1]   'shop'   Pos:9  Stuct:0x81 ( META FILE )
          swishdocpath: 6 (  6) S: "1.html"
          swishdocsize: 8 (  4) N: "0000000000110"
     swishlastmodified: 9 (  4) D: "2001-11-16 10:34:12"
                   foo:10 (  3) S: "bar"
           description:11 ( 30) S: "UK based             Wine Shop"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 6 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
6 unique words indexed.
6 properties sorted.                                              
1 file indexed.  110 total bytes.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!


> ./swish-e -w not dkdk  -p foo description
# SWISH format: 2.1-dev-24
# Search words: not dkdk
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.005 seconds
1000 1.html "1.html" 110 "bar" "UK based             Wine Shop"
.

I've always wondered what, if anything, should be done with that extra
white space.




>TmpDir                  /tmp/
>SpiderDirectory         ./
>Delay                   0
>MaxDepth                1

Use -S prog spider.pl for a faster spider.


>IgnoreLimit             80 1000

Don't use IgnoreLimit (see the 2.1-dev docs) more than once ;)

>IndexComments           0
>IndexContents           HTML    .lml .htm .html

If you are parsing html, then consider using libxml2 parser.  It's more
accurate.

>IgnoreWords             File: swish-stopwords.txt

I'm starting to think stopwords are bad, in general.  My list is about five
words long.


>IndexDir                http://www.bbr.com/gb.lml
>IndexFile               index.tmp





Bill Moseley
mailto:moseley@hank.org
Received on Fri Nov 16 18:56:37 2001