Skip to main content.
home | support | download

Back to List Archive

Slow indexing speed

From: Juan Salvador Castejón <juans.castejon(at)not-real.gmail.com>
Date: Fri Jun 03 2005 - 11:44:02 GMT
Hi,

I'm indexing a web site using spider.pl. At the beginning, indexing
was quite fast but as the process went ahead, it was slowing down
significantly.

The web site is in the intranet, so access time to web pages is very
short. The indexing process has spent three days to index 100,000
pages and it has not finished yet. There are a mixture of HTML, PDFs,
DOCs and RTFs being HTML and PDFs documents predominant. I'm using
catdoc, xpdf and unrtf as filter programs.

I think this indexing speed is slow. Watching the swish process I
observe that is waiting for IO (D status) almost all the time while
spider.pl is sleeping. I don't know if this behaviour is normal or is
a symptom of some kind of problem.

This is top's output for spider.pl and swish-e processes:

 13:32:23  up 46 days,  4:35,  2 users,  load average: 1,00, 1,00, 1,00
2 processes: 2 sleeping, 0 running, 0 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
           total    0,0%    0,0%    0,0%   0,4%     0,0%   99,5%    0,0%
           cpu00    0,0%    0,0%    0,0%   0,9%     0,0%   99,0%    0,0%
           cpu01    0,0%    0,0%    0,0%   0,0%     0,0%  100,0%    0,0%
Mem:   503872k av,  491040k used,   12832k free,       0k shrd,    3280k buff
                    374028k actv,   88740k in_d,    2400k in_c
Swap: 2096472k av, 1048440k used, 1048032k free                    5076k cached

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND
28007 root         15   0  636M 397M   540        D     0,0       
80,7  67:48   1 swish-e
28008 root         16   0  327M  28M   968         S     0,0         
5,8  62:41   1 spider.pl

Here are my spider.pl settings

my %carm = (
        use_default_config => 1,
        delay_sec       => 0,
        max_wait_time   => 10,
        max_size        => 0,
        use_cookies     => 1,
        use_md5         => 1,
        debug           => DEBUG_URL | DEBUG_SKIPPED | DEBUG_FAILED,
        base_url        => [qw! http://www.carm.es/ceh/
http://www.carm.es/cpres/ http://www.carm.es/educacion/
http://www.carm.es/cagric/ http://www.carm.es/csan/
http://www.carm.es/ctra/ http://www.carm.es/ceii/
http://www.carm.es/op/ http://www.carm.es/ctyc/
http://www.carm.es/sgppg/ !],
        email           => 'juans.castejon@carm.es',
        link_tags       => [qw/ a frame area /],
        keep_alive      => 1,
        use_head_requests => 1,
        test_response   => sub {
                                my $server = $_[1];
                                $server->{no_spider} = $_[0]->path =~
/.*\.(pdf|PDF|doc|DOC|xls|XLS|rtf|RTF|ppt|PPT)$/;
                                $server->{no_contents} = $_[0]->path
=~ /.*\.(mp3|avi|wma|jpg|gif|zip|bat|bmp|dot|eps|mdb|png|pps|psd|swf|tiff|wmf|wmv|tif|dwg|exe)$/;
                                $server->{no_contents} =
$_[2]->content_type =~ m[^image/];
                                return 1;
                               },
        test_url        => sub {
                                $_[0]->as_string =~
/^(http:\/\/)?www.carm.es\/(.)*/;
                               }, # I think this is unnecessary
);

And the swish-e's ones:

IndexDir /usr/local/lib/swish-e/spider.pl
SwishProgParameters  /root/buscador/spider.conf
StoreDescription HTML2 <body> 2500
StoreDescription TXT2 2500
PropertyNameAlias swishdescription description
DefaultContents HTML2
IndexContents HTML2 .htm .html .shtml .xhtml .jsp
IndexContents TXT2  .txt .log .text .pdf .doc .rtf .xls .ppt
IndexContents XML2  .xml
# Para ignorar acentos
TranslateCharacters :ascii7:
IgnoreTotalWordCountWhenRanking no


I wonder if it's possible to accelerate the indexing speed and I will
appreciate very much any ideas.

Thank you in advance.
Regards,
Juan Salvador Castejón Garrido
Received on Fri Jun 3 04:44:03 2005