Skip to main content.
home | support | download

Back to List Archive

Re: Max depth error??

From: Juan Salvador Castejón <juans.castejon(at)not-real.gmail.com>
Date: Tue May 24 2005 - 08:34:34 GMT
Oops, I forgot the attachment. Sorry

2005/5/24, Juan Salvador Castej=F3n <juans.castejon@gmail.com>:
> Hi,
>=20
> I'm indexing a web site using spider.pl on a windows XP machine. The
> problem is that shiwsh-e does not index files whose depth is >=3D 5. The
> spider is crawling rightly all the pages and supplying all them to
> swish-e, but swish ignores completly all the pages whose depth is >=3D
> 5.
>=20
> These are my configuration files:
>=20
> >> spider.conf
>=20
> my %carm =3D (
>         use_default_config =3D> 1,
>         max_depth       =3D> 10,
>         delay_sec       =3D> 0,
>         max_size        =3D> 0,
>         use_cookies     =3D> 1,
>         debug           =3D> DEBUG_URL | DEBUG_FAILED | DEBUG_SKIPPED | D=
EBUG_ERRORS |
> DEBUG_INFO | DEBUG_LINKS | DEBUG_HEADERS,
>         base_url        =3D> 'http://www.carm.es/ceh/',
>         email           =3D> 'juans.castejon@carm.es',
>         link_tags       =3D> [qw/ a frame area /],
>         keep_alive      =3D> 1,
>         test_response   =3D> sub {
>                                 my $server =3D $_[1];
>                                 $server->{no_spider} =3D $_[0]->path =3D~
> /.*\.(pdf|PDF|doc|DOC|xls|XLS|rtf|RTF|ppt|PPT)$/;
>                                 $server->{no_contents} =3D $_[0]->path =
=3D~
> /.*\.(mp3|avi|wma|jpg|gif|zip|bat|bmp|dot|eps|mdb|png|pps|psd|swf|tiff|wm=
f|wmv|tif|dwg|exe)$/;
>                                 $server->{no_contents} =3D $_[2]->content=
_type =3D~ m[^image/];
>                                 return 1;
>                                },
>         test_url        =3D> sub {
>                                 $_[0]->as_string =3D~ /^(http:\/\/)?www.c=
arm.es\/ceh\/(.)*/;
>                                },
> );
>=20
> @servers =3D (\%carm);
> 1;
>=20
> >> swish.conf
>=20
> IndexDir perl.exe
>=20
> SwishProgParameters lib\swish-e\spider.pl lib\swish-e\spider.conf
>=20
> StoreDescription HTML2 <body> 2500
>=20
> StoreDescription TXT2 2500
>=20
> StoreDescription HTML <body> 2500
>=20
> StoreDescription TXT 2500
>=20
> PropertyNameAlias swishdescription description
>=20
> DefaultContents HTML2
>=20
> IndexContents HTML* .htm .html .shtml .xhtml .jsp
>=20
> IndexContents TXT*  .txt .log .text .pdf .doc .rtf
>=20
> IndexContents XML*  .xml
>=20
> TranslateCharacters :ascii7:
>=20
> ParserWarnLevel 9
>=20
> IgnoreTotalWordCountWhenRanking no
>=20
> I run the swish-e as:
>=20
> swish-e.exe -S prog -c swish.conf -f index5.swish -v 3 -R 1
>=20
> Attached to this message are the output logs generated by the index
> process. Using them you can see how the spider is finding certaing
> pages (all of them with depth >=3D 5) that swish-e ignores (i.e. urls
> containing 'codmenu=3D91').
>=20
> Is it a bug or I'm missing something? Any help would be greatly appreciat=
ed.
> Thank you in advance.
>=20
> Juan Salvador Castej=F3n,
>



*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Tue May 24 01:34:34 2005