Re: [swish-e] XML parsing not returning Title

From: Robinson Craig <Craig.Robinson(at)>
Date: Mon Dec 03 2007 - 23:57:59 GMT
>On 11/28/2007 08:18 AM, Peter Karman wrote:
>> On 11/27/2007 04:56 PM, Robinson Craig wrote:
>>> I've run the same config and files with 2.4.5 (current stable
>>> installed on a DEV box (in readiness for deployment out to PROD),
>>> the same result (incidentally with no parsing errors).
>> what indexing method (-S) are you using? Can you paste the exact
command you
>> are using to index?
>nevermind. I see the issue.
>You need to add:
> PropertyNameAlias swishtitle title
>to your config.
>The HTML parser knows about the special '<title>' tagset and uses that
for the
>swishtitle property. The XML parser doesn't know about it. Since you
>indexing .pdf files with the HTML parser (is that what you really
want?), the
>.pdf docs get the title magic, but the .html docs (or anything else
parsed with
>the XML parser) needs a little help knowing which tag to use as the
>Peter Karman  .  peter(at)  .

Hi Peter,

I did try "PropertyNameAlias swishtitle title", and it does something
unexpected. Now the HTML page (parsed using XML2) returns the title
beautifully, but now the PDF file (parsed using HTML2) puts the content
of the PDF in the "title" field (see attached text file:

However, what really does interest me is the comment: ">indexing .pdf
files with the HTML parser (is that what you really want?)". I am
thinking that my approach is some-what non-standard :-).

What we are doing is converting the PDF to HTML by using "pdftotext
-htmlmeta" (part of Xpdf) and then indexing using HTML2. From what I
have gleaned from around the place, this seems to be one way of doing
it. What would be the alternative? Or, better still, the "Standard" way
for indexing PDF metadata as well as content? I have been reading (in
the forum) about how SWISH::Filters::Pdf2HTML uses 'pdfinfo'(also part
of Xpdf). I haven't really investigated using (which uses
Pdf2HTML by default) yet as I am trying to do this from the file system,
but would that be considered the more standard approach?

Thanks for your assistance.

Cheers, Craig


