Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] multiple Warnings: 'could not be encoded to charset 'ISO-8859-1'

From: Dr Michael Daly <"Dr>
Date: Thu, 15 Mar 2012 12:51:19 +1100 (EST)
Hi
The error report seems to be related to the *directory name* itself. I
determined this via:
1. I replaced the search directive in web_1.conf of
 SwishProgParameters default http://localhost:104

with two listings - one to a specific file, and one to the directory that
contains that specific file:
SwishProgParameters default http://localhost:104/_docs/test3
http://localhost:104/_docs/test3/Reception-duties.doc

2. And there is a 2nd file in /test3, this other file being
Reception-duties.doc renamed as Reception-duties-2.doc

3. This is the output
(& is it normal for swish to report it is indexing 'Data Source' and
"spider.pl"):
swish-e -S prog -c /share/MD0_DATA/swish-e-files/swish-e-conf/web_1.conf
Indexing Data Source: "External-Program"
Indexing "spider.pl"
External Program found: /opt/lib/swish-e/spider.pl
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
/opt/lib/swish-e/spider.pl: Reading parameters from 'default'

Summary for: http://localhost:104/_docs/test3/Reception-duties.doc
             Connection: Close:     1  (1.0/sec)
                   Total Bytes: 1,217  (1217.0/sec)
                    Total Docs:     1  (1.0/sec)
                   Unique URLs:     1  (1.0/sec)
application/msword->text/plain:     1  (1.0/sec)
Warning: document 'http://localhost:104/_docs/test3/' could not be encoded
to charset 'ISO-8859-1'

Summary for: http://localhost:104/_docs/test3
             Connection: Close:     1  (1.0/sec)
        Connection: Keep-Alive:     2  (2.0/sec)
                    Duplicates:     1  (1.0/sec)
            Location Redirects:     1  (1.0/sec)
                Off-site links:     5  (5.0/sec)
                   Total Bytes: 2,307  (2307.0/sec)
                    Total Docs:     3  (3.0/sec)
                   Unique URLs:     4  (4.0/sec)
application/msword->text/plain:     1  (1.0/sec)
                     text/html:     2  (2.0/sec)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 145 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
145 unique words indexed.
5 properties sorted.
4 files indexed.  3,524 total bytes.  450 total words.
Elapsed time: 00:00:03 CPU time: 00:00:00
Indexing done!

4. Search works:
swish-e -w opening
# SWISH format: 2.4.7
# Search words: opening
# Removed stopwords:
# Number of hits: 2
# Search time: 0.001 seconds
# Run time: 0.064 seconds
1000 http://localhost:104/_docs/test3/Reception-duties-2.doc
"Reception-duties-2.doc" 1217
1000 http://localhost:104/_docs/test3/Reception-duties.doc
"Reception-duties.doc" 1217


Thanks





Dr Michael Daly wrote on 3/14/12 7:40 AM:
> Maybe this is related to my previous problem, maybe not:

the .xls file errors probably are related.

> Whereby the content of web_1.conf is:
>  IndexDir spider.pl
>  SwishProgParameters default http://localhost:104
>  StoreDescription TXT 200
>  StoreDescription HTML <body> 200
>
> invoking this via:
> # swish-e -S prog -c
> /share/MD0_DATA/swish-e-files/swish-e-conf/web_1.conf
>
> outputs:
> Indexing Data Source: "External-Program"
> Indexing "spider.pl"
> External Program found: /opt/lib/swish-e/spider.pl
> Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
> Missing argument in sprintf at /opt/lib/swish-e/spider.pl line 38.
> /opt/lib/swish-e/spider.pl: Reading parameters from 'default'
> Warning: document 'http://localhost:104' could not be encoded to charset
> 'ISO-8859-1'

break it down to one file and see if you can isolate the problem. E.g. if
you
can fetch http://localhost:104 and write its contents to a file and then
index
that file directly with swish-e, then you know the problem is in the
spider
config. If you can't index the file with swish-e, then you know the
problem is
in your swish-e config and/or your document.

Encoding problems are common. Make sure your content is ISO-8859-1 or some
other
single-byte encoding, or is UTF-8 and be prepared that swish-e will
convert it
to 8859 internally when indexing.

--
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users



_______________________________________________
Users mailing list
Users(at)not-real.lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Mar 15 2012 - 02:01:25 GMT