Skip to main content.
home | support | download

Back to List Archive

Re: error indexing pdf files

From: Bill Moseley <moseley(at)>
Date: Tue Apr 15 2003 - 15:13:43 GMT
On Tue, 15 Apr 2003, Jody Cleveland wrote:

> Here's a chunk of what the output looks like:
> >> +Fetched 10 Cnt: 1982
> ages/internetguides/family.html 200 OK text/html 14975
> parent:http://www.oshkosh

Looks like some broken links.  Can you look back through that output and
find the problem.

RFC 2396:

      g) If the resulting buffer string still begins with one or more
         complete path segments of "..", then the reference is
         considered to be in error.  Implementations may handle this
         error by retaining these components in the resolved path (i.e.,
         treating them as part of the final URI), by removing them from
         the resolved path (i.e., discarding relative levels above the
         root), or by avoiding traversal of the reference.

Try adding this in your spider config file:


The docs are not clear, and if that doesn't work try


But I actually think that second is wrong and is a documentation error.
Then try and enter the parent URL path to the document where the dots
start happening to test.  Let me know if that fixes the problem.  I
suppose it should be on by default.

> This sounds like exactly what I'm looking for. I looked at the sample in
> swish.cgi:
>         Xselect_by_meta  => {
>             method      => 'checkbox_group',
>             columns     => 3,
>             metaname    => 'site',     # Can't be a metaname used elsewhere!
>             values      => [qw/misc mod vhosts other/],
>             labels  => {
>                 misc    => 'General Apache docs',
>                 mod     => 'Apache Modules',
>                 vhosts  => 'Virutal hosts',
>             },
>             description => 'Limit search to these areas: ',
>         },
> Are the values the individual directories? Would I have values =>
> [qw/citydirs etc/],? How do I activate that function? Also, for the
> ExtractPath, where does that go?

All that is is a way to limit by metanames.  So if you have 

  <meta name="section" content="accounting">
  <meta name="section" content="sales">

Then that puts up a checkbox group for selecting which you want to limit
by.  Then the swish.cgi script adds to the query (if selected just

    $query AND section=(sales)

As for where, did you see this?  Let me know if it's not clear:

        # Swish-e's ExtractPath would work well with this.  For example,
        # to allow limiting searches to specific sections of the apache docs use this
        # in your swish-e config file:
        #   ExtractPath site regex !^/usr/local/apache/htdocs/manual/([^/]+)/.+$!$1!
        #   ExtractPathDefault site other
        # which extracts the segment of the path after /manual/ and indexes that name
        # under the metaname "site".  Then searches can be limited to files with that
        # path (e.g. query would be swishdefault=foo AND site=vhosts to limit searches
        # to the virtual host section.

The tricky part about ExtractPath is that it is a substitution regular
expression -- to be used it has to match the left side, and the right side
(the substituted part) is what is stored in the index under the specified
metaname.  So you normally want to match the entire path ^ ... $

   ExtractPath site regex !^/usr/local/apache/htdocs/manual/([^/]+)/.+$!$1!

So that tries to match the entire path, capturing the path segment after
"manual", and then repalces the entire string with $1, what was captured.
That $1 path segment is indexed under the metaname "site".

Bill Moseley
Received on Tue Apr 15 15:17:28 2003