Skip to main content.
home | support | download

Back to List Archive

Re: Indexing PDFs on Windows - Revisited....

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Sep 23 2004 - 03:15:03 GMT
On Wed, Sep 22, 2004 at 03:21:46PM -0700, Anthony Baratta wrote:
> 
> I've tried two different approaches to this issue:

Not sure what the issue is yet, but I'll read on.  Let me know if any
of this doesn't make sense.  The short answer, I think, is to not use
a spider.pl config file at first, and let it use the default config
file.  One confusing thing about the spider is that there's a default
config file built into spider.pl that is only used when the word
"default" is used in the spider parameters:

   perl spider.pl default http://swish-e.org/

But if you use your own config file then that default config is not
used and you have make sure you setup everything in your own config --
like filtering and so on.  I've been meaning on changing that
behavior so that the defaults are always used and your own config just
overrides the defaults.

A few comments/suggestions:

Even under Windows you should be able to use forward slashes in the
config file.  Makes things easier to see than using all those back
slashes.

Also, you might find it less load on the web server to use keep_alive
than using a one second delay.  And faster indexing, too.


> The output for this option is a bit strange....while it attempts to index 
> the site, it fails to record word count for pages after the 16th link. This 
> link is a PDF and the spider appears to lockup on analyzing it and while it 
> fetches all the other links it finds, it fails to index these pages.

I'm not really sure what you are seeing there -- I'm spider right now
and there's a few pdfs that do take a bit of CPU to grind through.

Oh, wait, you are using a spider config file.  If you do that then the
spider will attempt to index *everything* it finds on the web server.
The spider config file is powerful -- and when you use a config file
you need to deal with a lot of things --  like setting up filtering
and limiting what the spider fetches (like just text/* content).

>  >> +Fetched 1 Cnt: 17 
> http://local.dev.port.com/newsroom/pressrel/pressrel_162.asp
> 	200 OK text/html 20249
> 	parent:http://local.dev.port.com
> 	! Found 0 links in 
> http://local.dev.port.com/newsroom/pressrel/pressrel_162.asp
> 		sleeping 1 seconds


> As you can see from #17 on, there are no word counts. At the end of #16 
> there is this weirdness: "- Using HTML2 parser -  (59 words) sleeping 1 
> seconds". Seems like commands are stepping on eachother.

Well, spider.pl is writing messages to stderr and swish-e is writing
to stdout -- and there's buffering so I'd expect them to be mixed.

> Then from #17 on - 
> it fails to index the fetched pages. The summary of work looks like this:
> 
> 	Summary for: http://local.dev.port.com
> 	       Total Docs:         968  (0.9/sec)


> 	16 files indexed.  435,308 total bytes.  5,753 total words.

Hum, that's odd.

Well, you can separate the process of indexing and spidering.  That
might make things easier to figure out.  You can run:

   perl /path/to/spider.pl default http://local.dev.port.com > spider.out

or 

   perl /path/to/spider.pl spider.config > spider.out

and you can then look at spider.out in an editor.  I suspect you would
want to use a spider.config file for setting keep_alive and maybe
limiting how many files are searched (under unix you can send a signal
to the spider to make it stop).  But, if you do use a spider config
file you have to control exactly what's spidered.

Then you can index like this:

   swish-e -c config -S prog -i stdin < output


Now this FileFilter command doesn't work under windows, I think.
You can't use single quotes on the Windows shell, right?

> 	FileFilter .pdf pdftotext "'%p' -"

Maybe use:

   FileFilter .pdf pdftotext '"%p" -'

but, I wouldn't filter that way, I'd filter in spider.pl.  The
SwishSpiderConfig.pl example file shows how to do that -- you need to
included a filter_content option that points to a subroutine that
processes documents with SWISH::Filter.

>   - Error: Couldn't open file
>     ''C:\Progra~1\SWISH-E\indexes\Tmp\swishspider@1108.contents''

Ya, I think it has something to do with those extra quotes.

[...]

> Common to both examples, the StoreDescription does not appear to be acted 
> on. I have no descriptions available via <swishdescription>, I get some 
> Date Time String (e.g " Local Time : 1:12:01 PM PT") instead. Nor does 
> swish appear to accept the IndexOnly / IndexContents directive - it 
> attempts to index the PDF anyway. It grabs the file then errors on "invalid 
> mime type". Is this correct behaviour? I would think that swish would skip 
> the file because of the .pdf extension not being the in the approved list.

Not unless you tell it to skip PDFs.  Open up spider.pl and search for
"default_url" and you will find the config the spider uses when you do
not specify one -- that is when you run spider with the word "default"
as the first parameter (ya, that's dumb, but it's because spider.pl
originally always took a file name as its first parameter).

IndexOnly won't work with -S prog (i.e. running spider.pl) --
IndexOnly is part of the -S fs input module.  Swish-e thinks if you
are running a program that fetches files then that program will only
send files that need to be indexed.


> This URL is a replica of the live site and should respond exactly the same.

Ok, so I used this command:

   $ SPIDER_DEBUG=skipped /usr/local/lib/swish-e/spider.pl \
         default http://test.portofoakland.com > pport

and it's been running while writing this message, so it has not
indexed that many pages.  I've see a lot of errors from pdftotext
like: 

   Error (4158918): Missing 'endstream'

You might need to ask the author of pdftotext what that means.  It's
also skipping a lot of PDFs because they are > 5MB (which is the
default max size for fetched docs).

Ok, so I just sent SIGHUP to the spider to stop it:

    Summary for: http://test.portofoakland.com
         Connection: Close:         9  (0.0/sec)
    Connection: Keep-Alive:       198  (0.1/sec)
                Duplicates:     5,367  (2.4/sec)
            Off-site links:       954  (0.4/sec)
                   Skipped:         8  (0.0/sec)
               Total Bytes: 4,885,055  (2188.6/sec)
                Total Docs:       198  (0.1/sec)
               Unique URLs:       208  (0.1/sec)

Then indexing:

    $ swish-e -e -S prog -i stdin < pport 
    Indexing Data Source: "External-Program"
    Indexing "stdin"
    Removing very common words...
    no words removed.
    Writing main index...
    Sorting words ...
    Sorting 14,886 words alphabetically
    Writing header ...
    Writing index entries ...
      Writing word text: Complete
      Writing word hash: Complete
      Writing word data: Complete
    14,886 unique words indexed.
    4 properties sorted.                                              
    198 files indexed.  4,885,055 total bytes.  282,012 total words.
    Elapsed time: 00:00:04 CPU time: 00:00:03
    Indexing done!

Oh, you were asking about storing the descriptions:

    $ cat c
    DefaultContents HTML*
    StoreDescription HTML* <body> 50

    $ swish-e -e -S prog -i stdin -c c -v0 < pport 

    $ swish-e -w port -m1 -p swishdescription -H0
    1000 http://test.portofoakland.com/pdf/boar_shee_040622.pdf "boar_shee_040622.pdf" 124467 "C JOHN PROTOPAPPAS President PATRICIA A. SCATES Fi"

Not sure where that first "C" (before John) comes from, but that's a separate issue.
But that's the 50 chars stored in the description.

If you are going to index a lot of PDF files I'd suggest caching them
as compressed text -- it would take a bit of programming, but it would
make reindexing faster if those docs don't change often.


Yes, it's time to update spider.pl.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Sep 22 20:15:20 2004