[swish-e] Having trouble trying to ignore invalid tags in HTML docs

From: Kathleen Vignos <kathleen(at)>
Date: Thu Mar 01 2007 - 22:58:15 GMT
Hello there,

I've just installed Swish-e 2.4.5 on our server (FreeBSD OS).  I am trying
to index over 100,000 HTML documents.  These documents have the following
example extraneous tags at the beginning of the HTML files:

I realize these are not valid HTML tags, but I didn't write these HTML docs.
Unfortunately I cannot change the original HTML docs to remove these tags
(or to insert <META> around them), so I'm looking for a way to get Swish-e
to ignore them.

I've spent some quality time with the Swish-e documentation and archives,
but everything seems to reference either ignoring meta tags (and these are
not meta tags) or ignoring specific tags while using the XML parser (but I
assume I need the HTML parser).

I've tried the following in the config file (swish.conf), with IgnoreWords
by itself, then IgnoreMetaTags by itself, then added Undefined MetaTags.  I
get the exact same results/errors each time.  I also tried commenting out
"DefaultContents HTML*" and also got the same results/errors (shown at the
bottom of this message).

# Tell swish-e what to index
IndexDir /usr/local/apache/htdocs/documents/

# Only index HTML files
IndexOnly .htm .html

# Use the HTML parser
DefaultContents HTML*

# Ignore words list
IgnoreWords /usr/local/apache/swish-e-2.4.5/ignorewords.txt

# Ignore certain tags
UndefinedMetaTags ignore

I continue to get the following error messages:

/usr/local/apache/htdocs/documents/doc.htm:1: error: Tag document invalid
/usr/local/apache/htdocs/documents/doc.htm:2: error: Tag type invalid
<TYPE>Type text here
/usr/local/apache/htdocs/documents/doc.htm:3: error: Tag sequence invalid
/usr/local/apache/htdocs/documents/doc.htm:4: error: Tag filename invalid
/usr/local/apache/htdocs/documents/doc.htm:5: error: Tag description invalid
<DESCRIPTION>Description here
/usr/local/apache/htdocs/documents/doc.htm:6: error: Tag text invalid
/usr/local/apache/htdocs/documents/doc.htm:7: error: htmlParseStartTag:
misplaced <html> tag
/usr/local/apache/htdocs/documents/doc.htm:7: error: htmlParseStartTag:
misplaced <head> tag
/usr/local/apache/htdocs/documents/doc.htm:9: error: Unexpected end tag :
/usr/local/apache/htdocs/documents/doc.htm:10: error: htmlParseStartTag:
misplaced <body> tag

Thanks so much for any help you can provide!

Best Regards, 

