On Mon, Jun 06, 2005 at 11:37:02AM -0700, Thomas Nyman wrote:
> I'm still struggling a bit with my remote indexing. I can index the
> remote machine directory called arkiv but when I do a search using
> that index I receive hits on the relevant documents but also on
> something called index of arkiv. What that is I dont know.
Did you look at this?
Those are all links on the /arkiv/ page.
> Parsing of undecoded UTF-8 will give garbage when decoding entities
That's from HTML::Parser. I'm not really clear what it means -- or
how to fix. The spider, IIRC, use LWP which uses HTML::Parser to
extract out meta data from the <head> of the document. That can be
disabled, I believe.
Here's that warning:
Parsing of undecoded UTF-8 will give garbage when decoding entities
(W) The first chunk parsed appears to contain undecoded UTF-8 and one
or more argspecs that decode entities are used for the callback
The result of decoding will be a mix of encoded and decoded characters
for any entities that expand to characters with code above 127. This
is not a good thing.
The solution is to use the Encode::encode_utf8() on the data before
feeding it to the $p->parse(). For $p->parse_file() pass a file that
has been opened in ":utf8" mode.
The parser can process raw undecoded UTF-8 sanely if the C<utf8_mode>
is enabled or if the "attr", "@attr" or "dtext" argspecs is avoided.
The important thing is to see if you are really indexing what you need
to index. Index a single file that causes that error using the -T
indexed_words feature and make sure everything is indexed.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Mon Jun 6 12:00:18 2005