On Fri, Jul 14, 2006 at 04:32:14PM +0100, Luke Simmons wrote:
> [root (at) tiger archive]# /usr/local/lib/swish-e/DirTree.pl
> edjanfeb06.pdf | grep title
>
> <title>Jan Feb 06</title>
> <meta name="title" content="Jan Feb 06">
>
> But without a filter it appears to not be parsing the html output
> from the pdf to the index. So after an index it doesn't show anything
> up in the search (cgi) including the title.
>
> Do I need to add pdf2HTML as a file filter in the config? And also
> make the changes that Peter Karman suggested? (thanks Peter)
>
> FileFilter .pdf /usr/local/lib/swish-e/perl/SWISH/Filters/
> Pdf2HTML.pm # Does this or anything need to go here?
No, again, as you can see from the output of DirTree.pl it's
producing *html*, so you don't want to tell swish to convert it to
html again. It's already been converted.
> DefaultContents HTML*
> StoreDescription HTML* <body> 200000
So, if you run swish-e from the command line is is the body stored in
the swishdescription property?
> Am I right to believe that when indexing the process pulls the PDF
> apart and each part is HTML tagged up (i.e. title > <title></title>
> and the text snippet to <body></body>)?
Not indexing (swish-e), but DirTree.pl does that (by using
SWISH::Filter). Look at DirTree.pl's output.
> Is the process then not putting the HTML into the index?
Should be. You can use -T indexed_words properties to see what's
ending up in the index while indexing.
> I added the old FileFilter of pdftotext in and this runs ok just
> without the title attribute working.
That shouldn't work. pdftotext isn't very good at converting html to
text.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Fri Jul 14 09:44:59 2006