Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e not indexing doc or PDF files

From: Liam Buchanan <Liam.Buchanan(at)not-real.dtrdi.qld.gov.au>
Date: Thu Feb 21 2008 - 01:50:15 GMT
Hi,
Thanks for the information.
I tried to do a trace but it didn't come up with anything unusual.

Below is my spider.pl file conf
Please let me know if there is anything in there I am missing or should
be taken out. The proxy reference needs to be in there for it to work.

Thanks !!

---------------


#remove SWISH::FILTER


my ($filter_sub, $response_sub) = swish_filter();


my %main_site = (
		 

		 base_url  =>
'http://intranet.sd.qld.gov.au/dsdweb/v4/apps/web/content.cfm?id=46',
		 #base_url  =>
'http://intranet.sd.qld.gov.au/dsdweb/v4/apps/web/content.cfm?id=97',
		 same_hosts    => '!172.16.100.246!
http://intranet.sd.qld.gov.au/!',
		 agent       => 'swish-e spider http://swish-e.org/',

		 email     => 'thomas.nguyen@sd.qld.gov.au',
		 debug       => DEBUG_URL | DEBUG_SKIPPED |
DEBUG_HEADERS,
		 keep_alive  => 1,         # Try to keep the connection
open
		 filter_content  => $filter_sub,  # use SWISH filter
		 max_depth => 2,
		 delay_sec => 0,
		 max_indexed => 100000,
		 max_time =>9000,
		 use_md5 => 1,
		 test_url  => sub { 
		     my ($uri, $server) = @_; 
		     # enable proxy requests
		     
		     unless ($::proxy_set++) {
			 my $ua = $server->{ua};
			 $ua->proxy('http', 'http://localhost:5865');

		     }

		     
		     # return true if not an image, otherwise false
		     return $uri->path !~
/\.(gif|jpeg|png|jpg|dat|log|exe)$/;

		     
		 },

		 );



@servers = ( \%main_site);


#You can also sett LWP::UserAgent to read the proxy data from the
environment. See perldoc LWP::UserAgent for details. 
 
use_default_config => 1,

-------------

EOF 

-----Original Message-----
From: users-bounces@lists.swish-e.org
[mailto:users-bounces@lists.swish-e.org] On Behalf Of Peter Karman
Sent: Wednesday, 13 February 2008 12:30 PM
To: Swish-e Users Discussion List
Subject: Re: [swish-e] Swish-e not indexing doc or PDF files



Liam Buchanan wrote on 2/12/08 5:43 PM:
> Hi,
> 
> I am using spider.pl to crawl. I have only 1 pdf on the entire 
> intranet as a test. I have tried both the domain and ip in the
hyperlink.
> I did some extensive testing yesterday. The strange thing is if I use 
> pdftotext or pdftohtml directly on a local file then it generates the 
> output correctly.
> It seems to have a big problem opening the pdf after running swish-e.
> this same pdf can be opened directly from a browser (as a binary file)

> and as stated before it opens when directly applying pdftotext and 
> pdftohtml in cmd.
> Heres the pdftohtml error:
> 
>  (523 words)
> http://*****.au/dsdweb/v4/apps/web/secure/docs/25.pdf - Using HTML
> 2 parser - Error: Couldn't open file ''http://*****.au/dsdweb/v4/a 
> pps/web/secure/docs/25.pdf''
>  (no words indexed)
> 
> Also I am not sure how to turn on the -T debugging - can you assist me

> with this.
> Verbose is active.
> 

Here's a brief example of how I test:

% cat spider.conf
my ($filter_sub, $response_sub) = swish_filter();

@servers = ({
     skip        => 0,         # Flag to disable spidering this host.

     base_url    =>
'http://peknet.com/~karpet/swish-e_documentation.pdf',

     agent       => 'swish-e spider http://swish-e.org/',
     email       => 'swish@domain.invalid',

     # This will generate A LOT of debugging information to STDOUT
     debug       => DEBUG_URL | DEBUG_SKIPPED | DEBUG_HEADERS,


     # Here are hooks to callback routines to validate urls and
responses
     # Probably a good idea to use them so you don't try to index
     # Binary data.  Look at content-type headers!

     test_url        => \&test_url,
     test_response   => $response_sub,
     filter_content  => $filter_sub,
} );

sub test_url {
  my ( $uri, $server ) = @_;
  return 1;  # Ok to index/spider
}

1;



now run the spider.pl

% spider.pl spider.conf | swish-e -S prog -i stdin

spider.pl: Reading parameters from 'spider.conf'

  -- Starting to spider:
http://peknet.com/~karpet/swish-e_documentation.pdf -- Indexing Data
Source: "External-Program"
Indexing "stdin"

vvvvvvvvvvvvvvvv HEADERS for
http://peknet.com/~karpet/swish-e_documentation.pdf
vvvvvvvvvvvvvvvvvvvvv

---- Request ------
GET http://peknet.com/~karpet/swish-e_documentation.pdf
Accept-Encoding: gzip; deflate
From: swish@domain.invalid
User-Agent: swish-e spider http://swish-e.org/


---- Response ---
Status: 200 OK
Connection: close
Date: Wed, 13 Feb 2008 02:28:19 GMT
Accept-Ranges: bytes
ETag: "cecd0-c6835-266e1740"
Server: Apache/2.0.54 (Fedora)
Content-Length: 813109
Content-Type: application/pdf
Last-Modified: Thu, 01 Dec 2005 19:06:29 GMT
Client-Date: Wed, 13 Feb 2008 02:28:24 GMT
Client-Peer: 209.98.116.241:80
Client-Response-Num: 1

^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^

 >> +Fetched 0 Cnt: 1 GET
http://peknet.com/~karpet/swish-e_documentation.pdf
200 OK application/pdf 813109 parent: depth:0

Summary for: http://peknet.com/~karpet/swish-e_documentation.pdf
          Connection: Close:       1  (0.1/sec)
                Total Bytes: 486,739  (54082.1/sec)
                 Total Docs:       1  (0.1/sec)
                Unique URLs:       1  (0.1/sec)
application/pdf->text/html:       1  (0.1/sec)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 3,835 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
3,835 unique words indexed.
4 properties sorted.
1 file indexed.  486,739 total bytes.  79,924 total words.
Elapsed time: 00:00:12 CPU time: 00:00:03 Indexing done!



--
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users

---------------------------------------------------------------------------- 
Unless stated otherwise, this email, together with any attachments, is 
intended for the named recipient(s) only and may contain privileged and 
confidential information. If received in error, you are asked to inform the 
sender as quickly as possible and delete this email and any copies of this 
from your computer system network. 

If not an intended recipient of this email, you must not copy, distribute or 
take any action(s) that relies on it; any form of disclosure, modification, 
distribution and/or publication of this email is also prohibited. 

Unless stated otherwise, this email represents only the views of the sender 
and not the views of the Queensland Government. 
----------------------------------------------------------------------------
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Feb 20 21:00:19 2008