Problems with libxml2 when using swish-e http spidering

From: Ken-Yu Lin <kengyuli(at)>
Date: Mon Jun 02 2003 - 16:36:56 GMT
I have and swish-e-2.2.3(built with libxml2) installed
with my student account in the school machine
(Sun-Blade-1000 with OS 5.8). I have prepared one config file for
testing with a small set of URLs as following:

#This is the config file number test for swish-e


IndexFile /home1/k/ke/ken/kenyulin/bin/test.index

IndexName "Online Searching Services for AEC Product Procurements"

IndexDescription "This is an index to test a small prototype"

IndexPointer ""

IndexAdmin "Ken-Yu Lin"

IndexReport 3

UseStemming no

IgnoreTotalWordCountWhenRanking no


IndexComments 0

MaxDepth 2

Delay 20

DefaultContents HTML2

TmpDir /home1/k/ke/ken/kenyulin/temp/

SpiderDirectory /home1/k/ke/ken/kenyulin/bin/

However, whenever I tried indexing via the http method using swish-e,
it always ended up with "err: No unique words indexed!".

------------------------------indexing result-----------------------------
Indexing Data Source: "HTTP-Crawler"
Indexing ""
retrieving (0)...
Indexing ""
retrieving (0)...
Indexing ""
retrieving (0)...
Indexing ""
retrieving (0)...

Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!

But when I tried the same config file in another
machine where libxml2 libraries were installed by someone else, it
worked (this machine runs
and linux). And I could see from the screen that the HTML2 parser was
utilized and how many words were indexed.

- Using HTML2 parser - (324 words)
Skipping Too deep.
Skipping Too deep.
Skipping Too deep.
Skipping Wrong method or server.

Has anyone experienced this before? Why is this happening? It looks like
with the first machine,
the HTML2 parser is not found (even thought I have specified it in the
config file and installed the needed libxml2
librarires.) But when I installed swish-e, I did see that libxml2 was
connected. Wired ~ ~ ~

Any help will be very appreciated.

Thank you!

Ken-Yu Lin.
Received on Mon Jun 2 16:37:08 2003