Skip to main content.
home | support | download

Back to List Archive

Re: index pdf files with spider.pl

From: <moseley(at)not-real.hank.org>
Date: Wed Jul 23 2003 - 16:27:00 GMT
On Wed, Jul 23, 2003 at 08:25:16AM -0700, Erik Lyons wrote:
> 
> PDF transformed: 1  (1.0/sec)
>         Skipped: 1  (1.0/sec)
>     Unique URLs: 1  (1.0/sec)
> 
> # file test.html
> test.html: empty

Sorry, I'm about to leave for the day so I can't help one step at a 
time.

So it looks like the PDF was transformed but it was skipped.  perldoc 
spider.pl should explain somewhat how to turn on debugging flags to see 
why it's being skipped.  Hopefully, that will make it clear why you are 
not getting the results you are expecting.

You should have all the tools you need to debug -- enable debugging in 
spider.pl to see why you are not getting output.  And once that's fixed 
you can pipe that output file into swish and use -T debugging options 
with swish-e to verify what's being indexed by swish.

If that doesn't work, post a URL of the PDF in question and your 
spider.pl config file and I'll take a look tomorrow.

> 
> >>> <moseley@hank.org> 07/23/03 07:58AM >>>
> On Wed, Jul 23, 2003 at 07:54:51AM -0700, Erik Lyons wrote:
> > Thanks Bill,
> > 
> > Run this way, spider.pl appears to expect perl, so given the "f.conf"
> 
> > example (list of directives) it fails in a bountiful blossom of
> syntax
> > errors. 
> 
> Right, sorry I wasn't clear:
> 
> >    spider.pl your_config_file.name > test.html
> 
> should be:
> 
>      spider.pl your_SPIDER_config_file.name > test.html
> 
> 
> 
> > 
> > >>> Bill Moseley <moseley@hank.org> 07/22/03 07:07PM >>>
> > On Tue, Jul 22, 2003 at 04:38:13PM -0700, Erik Lyons wrote:
> > > After several weeks of exclaiming joyful praise to the initial "S"
> > in
> > > SWISH, I stumbled across the example quoted below. It runs and
> > reports
> > > "PDF transformed:      2,009  (19.7/sec)", but no PDF files can be
> > > returned in any search results. As an added bonus, all document
> > titles
> > > that are in the search results appear as "(NULL)". Are these
> > problems
> > > related, or do I have 2 different gleaming horizons of delight to
> > > explore?
> > 
> > Hard to say, but probably not hard to debug.
> > 
> > Edit the spider's config file to point to a single PDF file.  Then
> just
> > 
> > run the spider like:
> > 
> >    spider.pl your_config_file.name > test.html
> > 
> > and look at test.html and make sure it has a title and content.
> > 
> > Then you can index that one PDF with:
> > 
> >    cat test.html | swish-e -c your_config -S prog -i stdin -T
> > properties
> > 
> > the -T properties will show you if the title is being stored.
> > 
> > 
> > 
> > 
> > -- 
> > Bill Moseley
> > moseley@hank.org 
> > 
> 
> -- 
> Bill Moseley
> moseley@hank.org 
> 

-- 
Bill Moseley
moseley@hank.org
Received on Wed Jul 23 16:27:20 2003