Skip to main content.
home | support | download

Back to List Archive

Re: Funky, unknown errors

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jul 02 2004 - 14:44:28 GMT
On Fri, Jul 02, 2004 at 05:25:27AM -0700, Alan Ivey wrote:
> I'm using the internal spider because I (unexperienced
> at Perl) found it easier to use FileFilters.

Actually the internal spider can also use the SWISH::Filter method for
filtering -- it's just not enabled by default because of the overhead of
loading all the SWISH::Filter modules for *every* document fetched.

But using FileFilter is fine if you don't mind the small overhead of
running extra processes for every document.

> # "DefaultContents should be used over IndexContents
> for using the internal spider"
> # - too bad it make the results page on swish.cgi look
> terrible! It shows the HTML tags!

I don't understand that comment.  If you index HTML as text then it
won't parse the html.  DefaultContents just sets the default document
type (which is used for things like StoreDescription) but swish-e would
still use the HTML parser by default.

> I wrote my own FileFilter. Like I said, I'm not a Perl
> expert by any means, so the only way I was able to get
> the results I wanted was to duplicate what I had done
> while testing with the command line. And since you
> can't do anything like "FileFilter .ppt command |
> command" (pipe), I wrote I simple bash script to do
> it. The perl script was written to remove the HTML
> tags from the ppthtml, since I really just want
> ppt-to-text. This is my humble attempt at it :)

All you should have to do is make sure it's parsed as HTML.

> retrieving
> http://iss2www.cdcsc.nmis.tk/ss/issapt/mio/7AStageONS1-4-01.doc
> (5)...
> retrieving
> http://iss2www.cdcsc.nmis.tk/ss/issapt/mio/Inc2postfltrpt_020504.doc
> (5)...
> Bad BBD entry!
> Segmentation fault

Looks like a problem with catdoc's reading of the file.  Google turns up
a few hits on the swish-e archive plus the code

   http://www.45.free.net/cgi-bin/cvsweb/catdoc/src/ole.c?cvsroot=soft&rev=1.14


                if (bbdSector >= fileLength/sectorSize) {
                        fprintf(stderr, "Bad BBD entry!\n");
                        ole_finish();
                        return NULL;
                }

First, you might use a script to run catdoc and print out the file name
before and after running catdoc to stderr to make sure you know what
file is failing.  Then try running catdoc by itself.  If catdoc works
fine by itself then in your script copy temporary file to disk and then
compare it with the original.

You are not running on Windows, right.  Otherwise I'd check
if \n is being turned into \n\r before being passed to catdoc.


> Another error I saw, which I'm just assuming was the
> FileFilter itself, NOT SWISH-E, but I'll post it
> anyway...
> 
> retrieving
> http://iss2www.cdcsc.nmis.tk/ss/issapt/mio/BackgroundCheckForm531.pdf
> (5)...
> Error (1064): Missing 'endstream'

Yes, that's an error fro pdftotext.  Do the same test with pdftotext
outside of swish.  Again, be careful about using the file name printed
right before the error message (since catdoc and pdftotext don't print
the file name in their error message).

> Aside from the first one, this other one is the most
> alarming to me, especially since I had been using the
> -e option when I ran SWISH-E...
> 
> error : Memory allocation failed : growing buffer

Again, that's not swish.  That might be an error from libxml2.
In that case I'd probably look at the HTML source and see what's odd
about it -- maybe validate it, too.  The libxml2 has a few support
programs in its package that might be able to help (i.e. be able to
parse the file outside of swish).  Look for the "xmllint" program (in
Debian it's in the libxml2-utils package) and use the --html switch if
needed.

> The tmp folder is odd too. Of course there are
> hundreds of swtmploc files in there because the index
> was interrupted, but the first two files show...

Just remove them.

> I'm new to SWISH-E, and fairly new to Linux (been
> using extensively for about 3 months), which explains
> why I don't understand these errors. I tried the
> external spider.pl, but I couldn't figure out how to
> write a module to convert .ppt files. Also when I ran
> spider.pl, there were these weird, random characters
> that would display on the search results pages for
> doc files that looked like squares with 4 characters
> in them.

Didn't someone post a Powerpoint filter not too long ago?

The odd chars make me wonder about an encoding problem -- that shouldn't
happen, although if your LANG is set to UTF8 then weird things have been
known to happen --swish-e is still in the stone age of 8-bit only
characters.  The libxml2 parser output to UTF8 but swish-e is suppose to
convert to 8859-1.  Regardless, you would just need to look at the
output and see what the character code is.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Jul 2 07:44:43 2004