Skip to main content.
home | support | download

Back to List Archive

Re: Unable to retrieve documents

From: Kaplan, Andrew H. <AHKAPLAN(at)not-real.PARTNERS.ORG>
Date: Wed Jan 07 2004 - 14:55:06 GMT
I modified the swish.cgi and swish.conf files and I have made some progress.

The links no longer have the NULL statement. However, the files are still 
inaccessible. When I check the URL for the file, it indicates the file is 
in the cgi-bin directory when in reality it is in the documenation
The swish.cgi file is located in the cgi-bin directory, and the swish.conf 
file is in the documentation directory.

When I created the index, I was in the documentation directory, and the
that was used was the following: /usr/local/bin/swish-e -c swish.conf -v 3.

I've included the two files in this e-mail.

The 'spaces' that I mentioned in the previous e-mail refer to the filenames.
example, one file that has been indexed is:

	Windows Workstation Environment Variables for IDL.pdf

-----Original Message-----
[]On Behalf Of Bill Moseley
Sent: Tuesday, January 06, 2004 4:55 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Unable to retrieve documents

On Tue, Jan 06, 2004 at 10:17:06AM -0800, Kaplan, Andrew H. wrote:
> I have set up our webserver such that the swish.cgi page comes up when
> a person wants to retrieve a document.  When the text is entered the
> results screen does appear with the appropriate links to the documents
> in question.  However, users are unable to access the documents.

Seems like if they can't be accessed then they are not appropriate

> The results screen does show the names of the files with their extensions,
> pdf, doc, etc. Immediately under
> the files the word NULL appears in parentheses.

That NULL is in the FAQ.  See the swish.cgi docs.

> The information about the file
> including its modification date, 
> size, and path also appears. Clicking on the file causes the error screen 
> 			Not Found -- The requested url was not found on this
> server 
> to appear.

Well, that's just a web server issue -- you have to make sure the paths
point to the right locations.

You can rewrite the the path when indexing (in the swish-e config file)
with ReplaceRules, and you can also prepend text to each path by a
setting the the swish.cgi config file.

> The files that are being indexed are either Adobe pdf, MS-Word doc,
> xls, and htm documents. They all have 
> spaces between the words in their titles. The server itself has the
> xls2csv, and xpdf programs installed. 

Space between their words in their "titles"?  Or do you mean file names.  I
suspect you 
mean file names.  You don't give much details so I can't know for sure, but
an example of indexing files with a space:

Notice that the href is correct:

moseley@bumby:~/apache$ echo "hello" >  "file with space.txt"

moseley@bumby:~/apache$ swish-e -i "file with space.txt" -v0

moseley(at)not-real.bumby:~/apache$ GET http://localhost/apache/swish.cgi?query=hello |
grep txt
        <dt>1 <a href="file%20with%20space.txt">file with space.txt</a>
<small>-- rank: <b>1000</b></small></dt>
<tr><td><small>Document Path:</small></td><td><small> <b>file with

> What do I need to do to correct this problem? Thanks.

Something like the above few lines that demonstrate the problem.

Here's another example with spidering:

moseley@bumby:~/apache$ cp test.pdf "test pdf with spaces.pdf"

moseley@bumby:~/apache$ /usr/local/lib/swish-e/ default
http://localhost/apache/test%20pdf%20with%20spaces.pdf | swish-e -S prog -i
stdin -v0
/usr/local/lib/swish-e/ Reading parameters from 'default'

Summary for: http://localhost/apache/test%20pdf%20with%20spaces.pdf
Total Bytes: 12,593  (12593.0/sec)
 Total Docs:      1  (1.0/sec)
Unique URLs:      1  (1.0/sec)

moseley(at)not-real.bumby:~/apache$ GET http://localhost/apache/swish.cgi?query=the |
grep pdf
        <dt>1 <a
st/apache/test pdf with spaces.pdf</a> <small>-- rank:
<tr><td><small>Document Path:</small></td><td><small>
<b>http://localhost/apache/test pdf with spaces.pdf</b></small></td></tr>

Bill Moseley

Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
Received on Wed Jan 7 14:55:20 2004