Skip to main content.
home | support | download

Back to List Archive

Parsing doc, xls and excel files with swish-e and libxml2

From: Animesh Bansriyar <animesh(at)not-real.arithme.net>
Date: Mon Jun 27 2005 - 19:05:16 GMT
Hi All,

I have been successfully parsing doc,xls and excel files with swish-e with
the following compiled software
SWISH-E 2.4.3 on libxml2 2.6.11 on a Debian Woody System and have been 
successfully able to parse the above mentioned files without any glitches
after removing the perl directories of swish-e and no catdoc or wvware 
installed, but with pcre and zlib present on the system.

The said thing also happens on Fedora Core-2 with the same libxml2 version.

I am unable to figure out what is happenning. Is libxml2 taking care of 
the parsing of the files or ...

This is the output. Notice the HTML2 parser being used. There are no 
filters being used as well.

A sample output on Debian follows:

root@laptop:/tmp# /usr/local/swish-e/bin/swish-e -i /opt/work_data/Neolinux.doc -v 20
Indexing Data Source: "File-System"
Indexing "/opt/work_data/Neolinux.doc"

Checking file "/opt/work_data/Neolinux.doc"...
  Neolinux.doc - Using DEFAULT (HTML2) parser -  (290 words)

Removing vroot@laptop:/tmp# /usr/local/swish-e/bin/swish-e -i /opt/work_data/Neolinux.doc -v 20
Indexing Data Source: "File-System"
Indexing "/opt/work_data/Neolinux.doc"

Checking file "/opt/work_data/Neolinux.doc"...
  Neolinux.doc - Using DEFAULT (HTML2) parser -  (290 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 193 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
193 unique words indexed.
4 properties sorted.
1 file indexed.  11,264 total bytes.  290 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
root@laptop:/tmp#ery common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 193 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
193 unique words indexed.
4 properties sorted.
1 file indexed.  11,264 total bytes.  290 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
root@laptop:/tmp#


But here is the problem: I have been trying to do the exact thing on Windows 
but have failed to do this so far. On windows there was no libpcre and I had
to do a lot of ugly hacks to get everything compiled properly under MinGW and
msys.

Could somebody try out with my sort of an enviornment and explain what is 
the case. A barebones install of swish-e with the latest libxml2 set of 
libraries.

Thanks in Advance,
Regards,
Animesh
Received on Mon Jun 27 12:05:21 2005