Re: FW: PDF indexing suddenly stopped working

From: Bill Moseley <moseley(at)>
Date: Fri Dec 02 2005 - 18:35:34 GMT
On Fri, Dec 02, 2005 at 10:09:54AM -0800, Chad Day wrote:
> Sorry, should have provided more detail.. I was doing a swish-e presentation in 30 minutes and then this broke, hence the panicking.
> It doesn't hang or anything, it just skips the PDFs when indexing via HTTP.

Are you using -S http method?  Don't use that.

I was going to say -S http method doesn't filter by default, but it
looks like it does.  But if you must, look at swishspider to see what
it's doing.

Why don't you use

moseley@bumby:~$ cat spider.conf
@servers = ( {
    base_url => '',
    max_files => 4,
    use_default_config => 1,
    email => '',

$ /usr/local/lib/swish-e/ spider.conf  > out
/usr/local/lib/swish-e/ Reading parameters from 'spider.conf'
/usr/local/lib/swish-e/ Max files Reached

Summary for:
     Connection: Close:      1  (1.0/sec)
Connection: Keep-Alive:      4  (4.0/sec)
            Duplicates:     71  (71.0/sec)
        Off-site links:     28  (28.0/sec)
           Total Bytes: 30,651  (30651.0/sec)
            Total Docs:      4  (4.0/sec)
           Unique URLs:      5  (5.0/sec)
             text/html:      4  (4.0/sec)

moseley@bumby:~$ swish-e -S prog -i stdin < out
Indexing Data Source: "External-Program"
Indexing "stdin"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 312 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
312 unique words indexed.
4 properties sorted.                                              
4 files indexed.  30,651 total bytes.  1,022 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!

Bill Moseley

Received on Fri Dec 2 10:35:35 2005