On Fri, Dec 02, 2005 at 10:09:54AM -0800, Chad Day wrote:
> Sorry, should have provided more detail.. I was doing a swish-e presentation in 30 minutes and then this broke, hence the panicking.
>
> It doesn't hang or anything, it just skips the PDFs when indexing via HTTP.
Are you using -S http method? Don't use that.
I was going to say -S http method doesn't filter by default, but it
looks like it does. But if you must, look at swishspider to see what
it's doing.
Why don't you use spider.pl?
moseley@bumby:~$ cat spider.conf
@servers = ( {
base_url => 'http://swish-e.org',
max_files => 4,
use_default_config => 1,
email => 'moseley@hank.org',
});
$ /usr/local/lib/swish-e/spider.pl spider.conf > out
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'spider.conf'
/usr/local/lib/swish-e/spider.pl: Max files Reached
Summary for: http://swish-e.org
Connection: Close: 1 (1.0/sec)
Connection: Keep-Alive: 4 (4.0/sec)
Duplicates: 71 (71.0/sec)
Off-site links: 28 (28.0/sec)
Total Bytes: 30,651 (30651.0/sec)
Total Docs: 4 (4.0/sec)
Unique URLs: 5 (5.0/sec)
text/html: 4 (4.0/sec)
moseley@bumby:~$ swish-e -S prog -i stdin < out
Indexing Data Source: "External-Program"
Indexing "stdin"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 312 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
312 unique words indexed.
4 properties sorted.
4 files indexed. 30,651 total bytes. 1,022 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Fri Dec 2 10:35:35 2005