Skip to main content.
home | support | download

Back to List Archive

Properties not being indexed with libxml2 enabled

From: Brent DeShazer <brent_deshazer(at)not-real.ksd.uscourts.gov>
Date: Thu Jan 22 2004 - 19:29:59 GMT
I have two seperate web servers with identical information (PDF and HTML
documents) that I am trying to index and subsequently view through
swish.cgi.

The first server is a Mandrake 7.2 server WITHOUT libxml2 installed. It is
an internal server and was set up with swish 2.4.1 first, and everything
worked great. The indexes were correctly created and a .prop file showed a
reasonable size (20MB), and search-results listings showed the first
couple-hundred characters of the files (swishdescription). My conf file is:

--------------------------------------------
IndexDir      /var/www/html/swish/dc-opinions-prog.pl
IndexFile     /var/www/html/swish/dc-opinions.index
UseStemming   yes
MetaNames     swishtitle swishdocpath
ReplaceRules remove /var/www
IndexContents HTML .pdf
IndexContents HTML .html
StoreDescription HTML* <body> 200000
--------------------------------------------

and the dc-opinions-prog.pl file is:

--------------------------------------------
#!/usr/bin/perl -w
use pdf2html;

my ($mtime,$size);
my @files =
`find /var/www/opinions -name '*' -print`;
for (@files) {
        chomp();
        if ($_ =~ /pdf$/) {
                my $html_record_ref = pdf2html($_);
                print $$html_record_ref;
        } elsif ($_ =~ /html$/) {
                $mtime=(stat())[9];
                $size=(stat())[7];
                print "Content-Length: $size\n";
                print "Last-Mtime: $mtime\n";
                print "Path-name: $_\n";
                print "\n";
                open(HTMLFILE, "$_") || die "Error opening $_";
                while (read(HTMLFILE, $buffer, 16384))  #Print the Poll HTML
file
                {
                        print $buffer;
                }
                close(HTMLFILE);
        }
}
--------------------------------------------

I am creating the indexes using the following command:

"/usr/local/bin/swish-e -c /var/www/html/swish/dc-opinions.conf -S prog"

When I tried to duplicate this setup with the /exact same data/ on a
Mandrake 8.2 server that DID have libxml2 installed, the .prop file never
got properly populated (only a couple-hundred KB in size) and of course the
swishdescription was not displayed in the swish.cgi search results.

My fix was to re-configure, re-compile and re-install swish on this server
using "./configure --without-libxml2". Now the output on this server matches
exactly that of our internal server.

So, is there something wrong with my config or program that I need to change
to use libxmol2, or is this a feature/bug?

>From the INSTALL file - "Libxml2 is very strongly recommended. It is used
for parsing both HTML and XML files. Swish-e can be built and installed
without libxml2, but the HTML parser built into swish-e is not as accurate
as libxml2" - so obviously I'd like to use libxml2 if possible.

Thanks,

---
Brent DeShazer
Manager of Systems Engineering
U.S. District Court, Kansas
785.295.2574
Received on Thu Jan 22 19:31:07 2004