Re: Description and Title not being parsed properly

From: Bill Moseley <moseley(at)>
Date: Thu Jan 17 2002 - 03:18:17 GMT
To continue: 

Here's on Windows 98:

E:\Program Files\SWISH-E>cat f
IndexContents HTML2 .htm .html .shtml
StoreDescription HTML2 <body> 200
MaxDepth 1
Delay 0
DefaultContents HTML2
IgnoreMetaTags style script

That last line:

   IgnoreMetaTags style script

makes the HTML2 parser skip the <script> section.

E:\Program Files\SWISH-E>swish-e -c f -S http
Indexing Data Source: "HTTP-Crawler"
Indexing ""
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 864 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
864 unique words indexed.
5 properties sorted.
1 file indexed.  76664 total bytes.  1653 total words.
Elapsed time: 00:00:05 CPU time: 00:00:05
Indexing done!

E:\Program Files\SWISH-E>swish-e -w you -x "Title=%t\n\nSummary\n<swishdescription>\n"
# SWISH format: 2.1-dev-25
# Search words: you
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.110 seconds
Title=Baptist Health - Health & Wellness

Condition Super Centers Diseases, Conditions and Injuries These comprehensive centers provide detailed information on some of the most common health conditions.
 This quick-reference guide offers the b

Hum, I thought it knew to truncate on white space instead of in the middle of a word...

BTW - you had this in your config:

   FileFilter .pdf prog-bin/pdf2html

My guess is that won't work on Windows like that.  See the docs for more info.

Hope this helps.

Hey David: What's the deal with this?  Is the percent sign a shell meta in Windows?


> ./swish-e -w you -x '%t\n%t\n' -H 0
Baptist Health - Health & Wellness
Baptist Health - Health & Wellness


E:\Program Files\SWISH-E>swish-e -w you -x "%t\n" -H0
Baptist Health - Health & Wellness

E:\Program Files\SWISH-E>swish-e -w you -x "%t\n%t" -H0

E:\Program Files\SWISH-E>swish-e -w you -x "%%t\n%%t" -H0
Baptist Health - Health & Wellness
Baptist Health - Health & Wellness

Bill Moseley
Received on Thu Jan 17 03:18:56 2002