On Fri, 28 Mar 2003, Yang Yang wrote:
> hi there:
> I'm trying to install swish-e-2.2.3 on my machine,
> I tried the swish.cgi under example/, but after I put in a query, it
> says "Service unavailable", what might be the errror?
Did you look at your web server's error log?
Debug CGI scripts from the command line before testing on the web server.
> I also don't know how to specify the index file to be used for
> swish.cgi, what is the format of the config file? I coudn't find
> documentation on this.
All that's hidden away in a obscure section of the documentation called:
> btw, does anybody know a good tool to extract title and author
> infomation from Postscript , pdf files?
1) In the filter-bin directory there's a file called _pdf2html.pl that can
be used with xpdf package (which includes pdftotext and pdfinto programs)
that will do that. The data in the info section of the pdf file is
returned in HTML <meta> tags.
This is the easiest method to use if indexing a few documents.
2) In the prog-bin directory there's a file called pdf2html that works the
same way, but is designed to be used with a -S prog program. The default
SwishSpiderConfig.pl spider.pl configuration file contains examples how to
This is a good method to use if using -S prog to fetch your documents and
you will only be filtering, say, just pdf files.
3) In the filters directory there a module (./filters/SWISH/Filters/Pdf2HTML.pm)
that also does the same thing, but is for use with the SWISH::Filter
module. It's a different way to filter docs and can be used by -S prog
programs (SwishSpiderConfig.pl also contains and example how to use the
SWISH::Filter method) and can also be used by the -S http method
(swishspider in the "src" will use SWISH::Filter method if the modules
This method requires installing the Perl modules (or just setting @INC
or PERL5LIB environment variable). It uses filter "plug-ins" and is
designed if you want to index a number of different types of documents.
> I tried on tool : harvest,
> the postscript can not be extracted correctly(always the first line
> of text , not the title is extracted), I'm also very interested in
> tools that are capable of extracting bitmapped postscript/pdf files.
> I think google does this pretty good
Check out the xpdf package and see if it can extract the text.
> I'm looking for these tools(including swish ) mainly to manage my
> collection of papers in PS and pdf files
Ghostscript has ps2ascii as well as ps2pdf.
Bill Moseley email@example.com
Received on Fri Mar 28 13:36:50 2003