Skip to main content.
home | support | download

Back to List Archive

Properties not being indexed with libxml2 enabled

From: Brent DeShazer <brent_deshazer(at)>
Date: Thu Jan 22 2004 - 19:29:59 GMT
I have two seperate web servers with identical information (PDF and HTML
documents) that I am trying to index and subsequently view through

The first server is a Mandrake 7.2 server WITHOUT libxml2 installed. It is
an internal server and was set up with swish 2.4.1 first, and everything
worked great. The indexes were correctly created and a .prop file showed a
reasonable size (20MB), and search-results listings showed the first
couple-hundred characters of the files (swishdescription). My conf file is:

IndexDir      /var/www/html/swish/
IndexFile     /var/www/html/swish/dc-opinions.index
UseStemming   yes
MetaNames     swishtitle swishdocpath
ReplaceRules remove /var/www
IndexContents HTML .pdf
IndexContents HTML .html
StoreDescription HTML* <body> 200000

and the file is:

#!/usr/bin/perl -w
use pdf2html;

my ($mtime,$size);
my @files =
`find /var/www/opinions -name '*' -print`;
for (@files) {
        if ($_ =~ /pdf$/) {
                my $html_record_ref = pdf2html($_);
                print $$html_record_ref;
        } elsif ($_ =~ /html$/) {
                print "Content-Length: $size\n";
                print "Last-Mtime: $mtime\n";
                print "Path-name: $_\n";
                print "\n";
                open(HTMLFILE, "$_") || die "Error opening $_";
                while (read(HTMLFILE, $buffer, 16384))  #Print the Poll HTML
                        print $buffer;

I am creating the indexes using the following command:

"/usr/local/bin/swish-e -c /var/www/html/swish/dc-opinions.conf -S prog"

When I tried to duplicate this setup with the /exact same data/ on a
Mandrake 8.2 server that DID have libxml2 installed, the .prop file never
got properly populated (only a couple-hundred KB in size) and of course the
swishdescription was not displayed in the swish.cgi search results.

My fix was to re-configure, re-compile and re-install swish on this server
using "./configure --without-libxml2". Now the output on this server matches
exactly that of our internal server.

So, is there something wrong with my config or program that I need to change
to use libxmol2, or is this a feature/bug?

>From the INSTALL file - "Libxml2 is very strongly recommended. It is used
for parsing both HTML and XML files. Swish-e can be built and installed
without libxml2, but the HTML parser built into swish-e is not as accurate
as libxml2" - so obviously I'd like to use libxml2 if possible.


Brent DeShazer
Manager of Systems Engineering
U.S. District Court, Kansas
Received on Thu Jan 22 19:31:07 2004