Skip to main content.
home | support | download

Back to List Archive


From: Bill Moseley <moseley(at)>
Date: Fri May 28 2004 - 17:17:49 GMT
On Fri, May 28, 2004 at 09:46:11AM -0700, wrote:
> *Note* sorry for all the "undisclosed"s. I'm an intern with a gov't
> contracting agency so I don't know what all is allowed to be public.

We understand.  Hey, we are getting used to it.

> # swish-e -c undisclosed.conf -v 3 -S http
> FileFilter		.html "/bin/cat"   "'%p'"

Doesn't that qualify for the a "useless use of cat" award?
Why are you using that?

> 1 file indexed.  1,133 total bytes.  9 total words.
> Elapsed time: 00:00:08 CPU time: 00:00:00
> Indexing done!

> - I don't know why that isn't working. Anyway, I switched to the
> method. I didn't edit at all, and here is my
> config file...

What's not working.  You mean it's not following any links?

> FileFilter 		.pdf pdftotext "'%p' -"
> FileFilter		.html "/bin/cat"   "'%p'"

You shouldn't need either of those.  In current versions of it
will automatically filter .pdf if you have pdftotext in your path.

> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
> http://undisclosed/visions/ESE_Strategy2003.pdf - Using HTML2 parser -  (no words indexed)

It may be that you are trying to filter (with FileFilter) something that
has already been filtered.

> - I'm completely lost. I wish there were some sample configurations... I've been reading Docs all day and don't know what I'm doing wrong. It can't be permissions because I'm running as root. Please help.

Here's the secret:

Look at docs and see how to enable some of the debugging
features -- that will tell you what files are skipped and why.

Then run the spider outside of swish something like:

   SPIDER_DEBUG=skipped /usr/local/swish-e/ default > out.txt

and then you can see what's skipped and why, and then you can look at
out.txt and see what your content looks like.

Bill Moseley
Received on Fri May 28 10:17:50 2004