Yet again, I've reinstalled swish-e (version 2.4.5) and have the same
effect (or defect):
Summary for: <SOME_URL>
Connection: Close: 3 (0.0/sec)
Connection: Keep-Alive: 224 (1.2/sec)
Duplicates: 60 (0.3/sec)
Off-site links: 14 (0.1/sec)
Total Bytes: 74,442 (402.4/sec)
Total Docs: 226 (1.2/sec)
Unique URLs: 227 (1.2/sec)
text/html: 1 (0.0/sec)
All files are suggested by spider.pl to be duplicates. Note that now
I've tried also on 3rd party site. Any suggestions?
On Feb 3, 2008 9:31 PM, Alexander Dolgarev <a.dolgarev@gmail.com> wrote:
> I've noticed alsp that for each following URL sider writes:
>
> Summary for: http://XXX/msg00298.html
> Connection: Close: 1 (1.0/sec)
> Connection: Keep-Alive: 4 (4.0/sec)
> Duplicates: 8 (8.0/sec)
> Off-site links: 2 (2.0/sec)
> Total Bytes: 7,950 (7950.0/sec)
> Total Docs: 5 (5.0/sec)
> Unique URLs: 5 (5.0/sec)
> text/html: 1 (1.0/sec)
>
> Summary for: http://XXX/maillist.html
> Duplicates: 1 (1.0/sec)
>
> Summary for: http://XXX/msg00297.html
> Duplicates: 1 (1.0/sec)
>
> but these URLs are not duplicates. Where the problem is?
>
>
> On Feb 3, 2008 8:53 PM, Alexander Dolgarev <a.dolgarev@gmail.com> wrote:
> > I have a problem with spider.pl. When I run
> > /usr/local/lib/swish-e/spider.pl default <SOME_URL> | swish-e -c
> > swish.conf -S prog -i stdin -f test
> > I've become a lot of following messages:
> > Warning: document 'XXX' has no content
> > When I look at created index-file I see that only document <SOME_URL>
> > was indexed, ALL other URLs (that were in this document) were not
> > indexed. Log files on the HTTP server shows that spider.pl retrieves
> > URLs and becomes responses, e.g:
> > [03/Feb/2008:18:46:43 +0100] <XXX> GET /XXX HTTP/1.1 "200" 14758
> > "swish-e http://swish-e.org/" "-" 18
> > That means that 14758 bytes was sent to the spider.pl for URL <XXX>,
> > but spider.pl says: Warning: document 'XXX' has no content
> >
>
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Mon Feb 4 08:16:27 2008