On Tue, Oct 25, 2005 at 10:11:54AM -0700, J. David Boyd wrote:
> > Perhaps it's indexed under a different metaname? We can only guess
> > since you are not providing any examples that we can reproduce.
> Hmm, how would I tell?
Well, this is what I'd do:
"Hum, I can't find "routable" but I'm sure it's in file "Foo.html"
that I'm indexing. Ok, let me step back a bit. First, I'll index
just that one file and look at what words are indexed:
swish-e -i Foo.html -T indexed_words | grep routable
Adding:[1:swishdefault(1)] 'routable' Pos:40 Stuct:0x9 ( BODY FILE )
Ok, so I see it's being index as "swishdefault". (But if it wasn't
then I'd go in and start hacking away at Foo.html to see why -- and
also enable ParserWarnLevel to see if the parser will find anything
Now, index the same file using my config and look for routable:
swish-e -i Foo.html -T indexed_words -c config | grep routable
And if it does show up you will know what metaname. If it doesn't
show up then you know there's something in the config that's making it
not show up. Start commenting out lines in the config until you see
> I'm in the ~/share/doc/swish-e/examples/conf directory,
> and I'm running
> swish-e -S prog -c example9.config
Those examples are really there to walk you through various ways of
doing things wish swish.
That will work, but I'd probably use a different method for parsing
the pdf files. spider.pl will automatically filter for you by
default. Or there's a program called DirTree.pl that walks a
directory tree (instead of using a web server).
You can also use swish-filter-test to index one file for testing:
$ swish-filter-test -content -headers -quiet test.pdf | swish-e -S prog -i stdin
Indexing Data Source: "External-Program"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1,547 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
1,547 unique words indexed.
4 properties sorted.
1 file indexed. 43,751 total bytes. 6,870 total words.
Elapsed time: 00:00:02 CPU time: 00:00:00
$ swish-filter-test -content -headers -quiet test.pdf | swish-e -S prog -i stdin -v0 -T indexed_words | grep recommended
Adding:[1:swishdefault(1)] 'recommended' Pos:1809 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'recommended' Pos:2887 Stuct:0x9 ( BODY FILE )
> /usr/home/tsc0/public_html/add/MOD_0/AAA-MOD0.TBL.pdf - Using XML parser
> - !!!Adding automatic MetaName 'all' found in file
I find little use for "auto" metanames. But, that's likely your
> Warning: XML parse error in file
> '/usr/home/tsc0/public_html/add/MOD_0/AAA-MOD0.TBL.pdf' line 18. Error:
> not well-formed
That's a bit odd. Maybe something isn't being escaped correctly or an
odd encoding error. Something to look at later.
> Then, like I said,
> swish-e -T index_all_words shows me all the words I am looking for, but
> I can't get one with the "-w".
> I thought that a 'swish-e -w WORD' would be the least restrictive kind
> of search...
That would be incorrect assumption. Swish doesn't work that way.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Tue Oct 25 10:42:36 2005