Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] html parse problem?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Feb 05 2007 - 14:27:46 GMT
On Mon, Feb 05, 2007 at 12:25:24AM -0800, Jordan Hayes wrote:
> I upgraded to 2.4.5 (from 2.4.3) today, but none of my Mailman archives 
> will index anymore.
> 
> I've narrowed it down to this:
> 
> ./001946.html:3: error: htmlParseEntityRef: expecting ';'
>    <A HREF="mailto:joe%40blow.org?Subject=Foo&In-Reply-To=">Hi</A>
>                                                          ^

Yes, I've sen that.  It's an invalid entity according to libxml2.

Are you sure it's what is causing you indexing to fail?   The error is
reported, but parsing continues.

$ cat test.html
<html>
<head>
<title>hello</title>
</head>
<body>
here is some text  <A HREF="mailto:joe%40blow.org?Subject=Foo&In-Reply-To=">Hi</A>
other text
</body>
</html>

$ swish-e -i test.html -T indexed_words
Indexing Data Source: "File-System"
Indexing "test.html"
    Adding:[1:swishdefault(1)]   'hello'   Pos:5  Stuct:0x7 ( HEAD TITLE FILE )
test.html:6: error: htmlParseEntityRef: expecting ';'
here is some text  <A HREF="mailto:joe%40blow.org?Subject=Foo&In-Reply-To=">Hi</
                                                                         ^
    Adding:[1:swishdefault(1)]   'here'   Pos:11  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'is'   Pos:12  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'some'   Pos:13  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'text'   Pos:14  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'hi'   Pos:15  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'other'   Pos:16  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'text'   Pos:17  Stuct:0x9 ( BODY FILE )
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 7 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
7 unique words indexed.
4 properties sorted.                                              
1 file indexed.  161 total bytes.  8 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs

_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Feb 5 06:27:57 2007