Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Using ExtractPath to Exclude Some Subdirectory from Search Result

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Sat Sep 19 2009 - 03:07:23 GMT
Ronny Rahardjo wrote on 9/18/09 5:48 PM:
> Hi Peter,
>  
> Please ignore my question no.1. I was able to figure out which spider.pl
> it is called. However, could you please let me know how can I check
> whether my spider.pl is using spiderconfig.pl. I found spiderconfig.pl
> in the same folder as swish.config, but I don't see any reference in the
> spider.pl.

try putting a:

 die "yes, you are using me!";

statement at the top of spiderconfig.pl and then run the spider.pl.

However, this line in the config you posted here:

SwishProgParameters default http://www.domainname.com/index.html

suggests that you are using the default config, not your spiderconfig.pl file.

>  
> And secondly, how can I exclude "a href=#tab" link in spider.pl

I'm think spider.pl will ignore a link like '#tab' since that's just a
self-referential link. Example:

[karpet@pekmac:~/Sites]$ SPIDER_DEBUG=url,links spider.pl default
http://localhost/~karpet/tab.html
/Users/karpet/bin/spider.pl: Reading parameters from 'default'

 -- Starting to spider: http://localhost/~karpet/tab.html --
>> +Fetched 0 Cnt: 1 GET  http://localhost/~karpet/tab.html  200 OK text/html
141 parent: depth:0

Extracting links from http://localhost/~karpet/tab.html:

Looking at extracted tag '<a href="#tab">'
  tag did not include any links to follow or is a duplicate
Path-Name: http://localhost/~karpet/tab.html
Content-Length: 141
Last-Mtime: 1253329219
Document-Type: html*

<html>
 <head>
  <title>test doc</title>
 </head>
 <body>

  foo bar <a href="#tab">nothing to see here</a> and more here

 </body>
</html>


Summary for: http://localhost/~karpet/tab.html
Connection: Close:   1  (1.0/sec)
       Duplicates:   1  (1.0/sec)
      Total Bytes: 141  (141.0/sec)
       Total Docs:   1  (1.0/sec)
      Unique URLs:   1  (1.0/sec)
        text/html:   1  (1.0/sec)




So I think you need to run spider.pl with your config against a test document
and see what kind of output you get. Turn on the debugging options like I
suggested. Ultimately, you're the only one who is going to discover the answer
to your problem. I'm just suggesting approaches to try.

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Sep 18 23:07:23 2009