Skip to main content.
home | support | download

Back to List Archive

Re: swish-e on a large scale

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Sep 30 2004 - 18:47:07 GMT
On Thu, Sep 30, 2004 at 11:15:53AM -0700, Aaron Levitt wrote:
> If you have a suggestion on how to get spider.pl to only index
> certain dirs, I would appreciate it.  This way I can also run
> seperate indexes.

Yes, I know it's long (and I'm right now trying to thin it out) but:

  http://swish-e.org/current/docs/spider.html

has the answers.  Look at test_url.

> How do I use the new btree database back-end?  Would this be better 
> than the default?

You build swish with --enable-incremental

Likely, in your case.  I don't know how much it's been tested.  Report
back your findings.

> >In general, you can spider separately from indexing.  Just capture the
> >output from the spider to a file and when done pipe that into swish.
> 
> This might be an alternative, can you give me a little more detail on 
> how the best way to do this would be?

  spider.pl | gzip > output.gz
  gzip -dc output.gz | swish-e -S prog -c foo -i stdin

> Unfortunately, I can't do this, due to production environment 
> limitations.  Also, for the sake of protecting email addresses, it has 
> to be done via http rather than directly from the file system.

Not sure what production environment limitations are, but:

You can hide the email addresses.

  s/\(w+)@\w+\.\w+/$1@whitehouse.gov/g;

> So, at this point I am going to move swish to a faster box with about 
> double the RAM, and I will rebuild swish with that option.  Should I 
> continue to use the database the way it is, or should I try the btree 
> database back-end?  I have also been considering the mysql route.

1) get a gzip of everything.
2) index and see how long it takes
3) build swish with --enable-incremental
4) index and see how long it takes (and tell the list, too)
5) add a file to the index and see how long it takes.

There's so many ways to do this that you will likely have to try
different options and see what works best.



> 475,944 unique words indexed.
> 5 properties sorted.
> 637,449 files indexed.  2,932,324,538 total bytes.  231,714,672 total 
> words.
> Elapsed time: 47:57:02 CPU time: 04:30:30
                _______            _______

Either your disks are really slow or you were out of memory.  Or both.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Sep 30 11:47:20 2004