I made some progress. I was able to reduce the duplication. It seems the
root of the issue is on swishe.config when I specify the
I specified SwishProgParameters default
I decided to take out the default one and use only spiderconfig.pl and
everything seems to work fine.
The only problem now, the rank seems not correct for some result. The rank
can be at 477 in 5 row and then 366 in another 7 rows.
Also, sometime we cannot find the word we search but it appear on the result
On Fri, Sep 18, 2009 at 8:07 PM, Peter Karman <email@example.com> wrote:
> Ronny Rahardjo wrote on 9/18/09 5:48 PM:
> > Hi Peter,
> > Please ignore my question no.1. I was able to figure out which spider.pl
> > it is called. However, could you please let me know how can I check
> > whether my spider.pl is using spiderconfig.pl. I found spiderconfig.pl
> > in the same folder as swish.config, but I don't see any reference in the
> > spider.pl.
> try putting a:
> die "yes, you are using me!";
> statement at the top of spiderconfig.pl and then run the spider.pl.
> However, this line in the config you posted here:
> SwishProgParameters default http://www.domainname.com/index.html
> suggests that you are using the default config, not your spiderconfig.plfile.
> > And secondly, how can I exclude "a href=#tab" link in spider.pl
> I'm think spider.pl will ignore a link like '#tab' since that's just a
> self-referential link. Example:
> [karpet@pekmac:~/Sites]$ SPIDER_DEBUG=url,links spider.pl default
> /Users/karpet/bin/spider.pl: Reading parameters from 'default'
> -- Starting to spider: http://localhost/~karpet/tab.html --
> >> +Fetched 0 Cnt: 1 GET http://localhost/~karpet/tab.html 200 OK
> 141 parent: depth:0
> Extracting links from http://localhost/~karpet/tab.html:
> Looking at extracted tag '<a href="#tab">'
> tag did not include any links to follow or is a duplicate
> Path-Name: http://localhost/~karpet/tab.html
> Content-Length: 141
> Last-Mtime: 1253329219
> Document-Type: html*
> <title>test doc</title>
> foo bar <a href="#tab">nothing to see here</a> and more here
> Summary for: http://localhost/~karpet/tab.html
> Connection: Close: 1 (1.0/sec)
> Duplicates: 1 (1.0/sec)
> Total Bytes: 141 (141.0/sec)
> Total Docs: 1 (1.0/sec)
> Unique URLs: 1 (1.0/sec)
> text/html: 1 (1.0/sec)
> So I think you need to run spider.pl with your config against a test
> and see what kind of output you get. Turn on the debugging options like I
> suggested. Ultimately, you're the only one who is going to discover the
> to your problem. I'm just suggesting approaches to try.
> Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
> Users mailing list
Users mailing list
Received on Thu Oct 22 20:19:13 2009