Skip to main content.
home | support | download

Back to List Archive

Problems with libxml2 when using swish-e http spidering

From: Ken-Yu Lin <kengyuli(at)not-real.ms21.hinet.net>
Date: Mon Jun 02 2003 - 16:36:56 GMT
I have libxml2.so.2.5.7 and swish-e-2.2.3(built with libxml2) installed
with my student account in the school machine
(Sun-Blade-1000 with OS 5.8). I have prepared one config file for
testing with a small set of URLs as following:

-----------------------------test.config---------------------------------
#This is the config file number test for swish-e


IndexDir http://www.lexcotile.com/products.htm
IndexDir http://www.epicurious.com/g_gourmet/g03_qanda/tips.html
IndexDir http://www.infotile.com/asa/html/powdered.html
IndexDir http://www.infotile.com/beaumont/tips/adhesives.html


IndexFile /home1/k/ke/ken/kenyulin/bin/test.index

IndexName "Online Searching Services for AEC Product Procurements"

IndexDescription "This is an index to test a small prototype"

IndexPointer "http://ckdd.cee.uiuc.edu/research/index.html"

IndexAdmin "Ken-Yu Lin"

IndexReport 3

UseStemming no

IgnoreTotalWordCountWhenRanking no

WordCharacters
abcdefghijklmnopqrstuvwxyz\&#;0123456789.@|,-'"[](~!@$%^{}_+?a'e'i'o'u'u"n~A'E'I'O'U'U"N~??

IndexComments 0

MaxDepth 2

Delay 20

DefaultContents HTML2

TmpDir /home1/k/ke/ken/kenyulin/temp/

SpiderDirectory /home1/k/ke/ken/kenyulin/bin/
---------------------------------------------------------------------------
--


However, whenever I tried indexing via the http method using swish-e,
it always ended up with "err: No unique words indexed!".

------------------------------indexing result-----------------------------
Indexing Data Source: "HTTP-Crawler"
Indexing "http://www.lexcotile.com/products.htm"
retrieving http://www.lexcotile.com/products.htm (0)...
Indexing "http://www.epicurious.com/g_gourmet/g03_qanda/tips.html"
retrieving http://www.epicurious.com/g_gourmet/g03_qanda/tips.html (0)...
Indexing "http://www.infotile.com/asa/html/powdered.html"
retrieving http://www.infotile.com/asa/html/powdered.html (0)...
Indexing "http://www.infotile.com/beaumont/tips/adhesives.html"
retrieving http://www.infotile.com/beaumont/tips/adhesives.html (0)...

Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.
-------------------------------------------------------------------------------


But when I tried the same config file in another
machine where libxml2 libraries were installed by someone else, it
worked (this machine runs libxml2.so.2.5.7
and linux). And I could see from the screen that the HTML2 parser was
utilized and how many words were indexed.

--------------------------------------------------------
- Using HTML2 parser - (324 words)
Skipping http://www.infotile.com/beaumont/index.htm: Too deep.
Skipping http://www.infotile.com/beaumont/index.htm: Too deep.
Skipping http://www.infotile.com/beaumont/locations/index.htm: Too deep.
Skipping http://www.southportceramics.com.au/: Wrong method or server.
...............
--------------------------------------------------------


Has anyone experienced this before? Why is this happening? It looks like
with the first machine,
the HTML2 parser is not found (even thought I have specified it in the
config file and installed the needed libxml2
librarires.) But when I installed swish-e, I did see that libxml2 was
connected. Wired ~ ~ ~

Any help will be very appreciated.

Thank you!


Ken-Yu Lin.
Received on Mon Jun 2 16:37:08 2003