Oops, I forgot the attachment. Sorry
2005/5/24, Juan Salvador Castej=F3n <juans.castejon@gmail.com>:
> Hi,
>=20
> I'm indexing a web site using spider.pl on a windows XP machine. The
> problem is that shiwsh-e does not index files whose depth is >=3D 5. The
> spider is crawling rightly all the pages and supplying all them to
> swish-e, but swish ignores completly all the pages whose depth is >=3D
> 5.
>=20
> These are my configuration files:
>=20
> >> spider.conf
>=20
> my %carm =3D (
> use_default_config =3D> 1,
> max_depth =3D> 10,
> delay_sec =3D> 0,
> max_size =3D> 0,
> use_cookies =3D> 1,
> debug =3D> DEBUG_URL | DEBUG_FAILED | DEBUG_SKIPPED | D=
EBUG_ERRORS |
> DEBUG_INFO | DEBUG_LINKS | DEBUG_HEADERS,
> base_url =3D> 'http://www.carm.es/ceh/',
> email =3D> 'juans.castejon@carm.es',
> link_tags =3D> [qw/ a frame area /],
> keep_alive =3D> 1,
> test_response =3D> sub {
> my $server =3D $_[1];
> $server->{no_spider} =3D $_[0]->path =3D~
> /.*\.(pdf|PDF|doc|DOC|xls|XLS|rtf|RTF|ppt|PPT)$/;
> $server->{no_contents} =3D $_[0]->path =
=3D~
> /.*\.(mp3|avi|wma|jpg|gif|zip|bat|bmp|dot|eps|mdb|png|pps|psd|swf|tiff|wm=
f|wmv|tif|dwg|exe)$/;
> $server->{no_contents} =3D $_[2]->content=
_type =3D~ m[^image/];
> return 1;
> },
> test_url =3D> sub {
> $_[0]->as_string =3D~ /^(http:\/\/)?www.c=
arm.es\/ceh\/(.)*/;
> },
> );
>=20
> @servers =3D (\%carm);
> 1;
>=20
> >> swish.conf
>=20
> IndexDir perl.exe
>=20
> SwishProgParameters lib\swish-e\spider.pl lib\swish-e\spider.conf
>=20
> StoreDescription HTML2 <body> 2500
>=20
> StoreDescription TXT2 2500
>=20
> StoreDescription HTML <body> 2500
>=20
> StoreDescription TXT 2500
>=20
> PropertyNameAlias swishdescription description
>=20
> DefaultContents HTML2
>=20
> IndexContents HTML* .htm .html .shtml .xhtml .jsp
>=20
> IndexContents TXT* .txt .log .text .pdf .doc .rtf
>=20
> IndexContents XML* .xml
>=20
> TranslateCharacters :ascii7:
>=20
> ParserWarnLevel 9
>=20
> IgnoreTotalWordCountWhenRanking no
>=20
> I run the swish-e as:
>=20
> swish-e.exe -S prog -c swish.conf -f index5.swish -v 3 -R 1
>=20
> Attached to this message are the output logs generated by the index
> process. Using them you can see how the spider is finding certaing
> pages (all of them with depth >=3D 5) that swish-e ignores (i.e. urls
> containing 'codmenu=3D91').
>=20
> Is it a bug or I'm missing something? Any help would be greatly appreciat=
ed.
> Thank you in advance.
>=20
> Juan Salvador Castej=F3n,
>
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Tue May 24 01:34:34 2005