Skip to main content.
home | support | download

Back to List Archive

[swish-e] Having trouble trying to ignore invalid tags in HTML docs

From: Kathleen Vignos <kathleen(at)not-real.vinedesign.net>
Date: Thu Mar 01 2007 - 22:58:15 GMT
Hello there,

I've just installed Swish-e 2.4.5 on our server (FreeBSD OS).  I am trying
to index over 100,000 HTML documents.  These documents have the following
example extraneous tags at the beginning of the HTML files:
<DOCUMENT><FILENAME><DESCRIPTION><SEQUENCE> 

I realize these are not valid HTML tags, but I didn't write these HTML docs.
Unfortunately I cannot change the original HTML docs to remove these tags
(or to insert <META> around them), so I'm looking for a way to get Swish-e
to ignore them.

I've spent some quality time with the Swish-e documentation and archives,
but everything seems to reference either ignoring meta tags (and these are
not meta tags) or ignoring specific tags while using the XML parser (but I
assume I need the HTML parser).

I've tried the following in the config file (swish.conf), with IgnoreWords
by itself, then IgnoreMetaTags by itself, then added Undefined MetaTags.  I
get the exact same results/errors each time.  I also tried commenting out
"DefaultContents HTML*" and also got the same results/errors (shown at the
bottom of this message).

# Tell swish-e what to index
IndexDir /usr/local/apache/htdocs/documents/

# Only index HTML files
IndexOnly .htm .html

# Use the HTML parser
DefaultContents HTML*

# Ignore words list
IgnoreWords /usr/local/apache/swish-e-2.4.5/ignorewords.txt

# Ignore certain tags
IgnoreMetaTags DOCUMENT FILENAME DESCRIPTION SEQUENCE
UndefinedMetaTags ignore

I continue to get the following error messages:


/usr/local/apache/htdocs/documents/doc.htm:1: error: Tag document invalid
<DOCUMENT>
         ^
/usr/local/apache/htdocs/documents/doc.htm:2: error: Tag type invalid
<TYPE>Type text here
     ^
/usr/local/apache/htdocs/documents/doc.htm:3: error: Tag sequence invalid
<SEQUENCE>4
         ^
/usr/local/apache/htdocs/documents/doc.htm:4: error: Tag filename invalid
<FILENAME>doc.htm
         ^
/usr/local/apache/htdocs/documents/doc.htm:5: error: Tag description invalid
<DESCRIPTION>Description here
            ^
/usr/local/apache/htdocs/documents/doc.htm:6: error: Tag text invalid
<TEXT>
     ^
/usr/local/apache/htdocs/documents/doc.htm:7: error: htmlParseStartTag:
misplaced <html> tag
<HTML><HEAD>
     ^
/usr/local/apache/htdocs/documents/doc.htm:7: error: htmlParseStartTag:
misplaced <head> tag
<HTML><HEAD>
           ^
/usr/local/apache/htdocs/documents/doc.htm:9: error: Unexpected end tag :
head
</HEAD>
       ^
/usr/local/apache/htdocs/documents/doc.htm:10: error: htmlParseStartTag:
misplaced <body> tag
 <BODY BGCOLOR="WHITE">
      ^

Thanks so much for any help you can provide!

Best Regards, 
Kathleen


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 1 17:55:00 2007