Skip to main content.
home | support | download

Back to List Archive

[swish-e] "-S prog" mashing up words in HTML files

From: Matthew Stanislawski <stnslwsk(at)not-real.uiuc.edu>
Date: Tue Mar 20 2007 - 21:15:32 GMT
Hi,

I'm having a strange problem catching particular words when indexing an 
HTML document.  Our documents are retrieved from a database using a perl 
script, and fed to "swish-e -S prog -i stdin" (in a single stream with 
documents separated by Path-Name lines, etc).  In this example, the 
offending words are contained in a <table> written out in one very long 
line (blame our CMS for that).  It seems that swish-e, in stripping the 
HTML tags, ends up mashing together words that appear on opposite sides 
of the string "</td><td>".  I.e., in a line containing this snippet:

...loading dock)<br/></td></tr><tr><td>&nbsp;H5</td><td>DCL Hallway...

neither "h5" nor "dcl" show up as indexed words, but instead "h5dcl" 
does.  Strangely, if I save the document source to a text file and index 
it with "swish-e -i file.html", "h5" and "dcl" are correctly indexed as 
separate words.  I've made sure that our perl script isn't doing 
anything funny to the HTML.  I've also tried increasing MaxWordLimit (in 
case those terrible long lines were the culprit).

Here's my swish.cfg, with some MetaName* and PropertyName* directives 
stripped out for brevity:

HTMLLinksMetaName links
ImageLinksMetaName images
IndexAltTagMetaName as-text
FuzzyIndexingMode Stemming_en2
IgnoreTotalWordCountWhenRanking yes
TranslateCharacters :ascii7:
MaxWordLimit 50

Here are snippets from "-T PARSED_WORDS INDEXED_WORDS" when indexing 
using "-S prog -i stdin" and "-i file.html", respectively:

---
./spew_documents.pl | swish-e -f index.file -S prog -i stdin -c 
~/etc/swish.cfg -T PARSED_WORDS INDEXED_WORDS | less

White-space found word 'dock)H5DCL'
     Adding:[120:swishdefault(1)]   'dock'   Pos:289  Stuct:0x1 ( FILE )
     Adding:[120:details(13)]   'dock'   Pos:289  Stuct:0x1 ( FILE )
     Adding:[120:swishdefault(1)]   'h5dcl'   Pos:290  Stuct:0x1 ( FILE )
     Adding:[120:details(13)]   'h5dcl'   Pos:290  Stuct:0x1 ( FILE )
---
---
swish-e -i file.html -c ~/etc/swish.cfg -T PARSED_WORDS INDEXED_WORDS | less

White-space found word 'dock)'
     Adding:[1:swishdefault(1)]   'dock'   Pos:508  Stuct:0x89 ( META 
BODY FILE )
     Adding:[1:details(13)]   'dock'   Pos:508  Stuct:0x89 ( META BODY 
FILE )
White-space found word '<A0>H5'
     Adding:[1:swishdefault(1)]   'h5'   Pos:513  Stuct:0x89 ( META BODY 
FILE )
     Adding:[1:details(13)]   'h5'   Pos:513  Stuct:0x89 ( META BODY FILE )
White-space found word 'DCL'
     Adding:[1:swishdefault(1)]   'dcl'   Pos:516  Stuct:0x89 ( META 
BODY FILE )
     Adding:[1:details(13)]   'dcl'   Pos:516  Stuct:0x89 ( META BODY FILE )
---


Any ideas?

Thanks,
Matt Stanislawski
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Mar 20 17:15:40 2007