Skip to main content.
home | support | download

Back to List Archive

RE: Hit-highlighting of PDF files

From: Herman Knoops <hk.sw(at)not-real.knoman.com>
Date: Wed Jun 28 2006 - 10:10:49 GMT
>
> Has anyone tried to perform hit-highlighting of PDF files using Swish-E?
>

We have PDF hit-highlighting working where Swish-E is used as a lower
level search engine. However, we did not modify Swish-E for this, because
characters and words are treated differently if you compare Swish-E and
Acrobat Reader.

The trick to make PDF hit-highlighting work accurately, is that you always
have to look at characters/words the way Acrobat does. This means you cannot
use the normal PDF2TXT-filtering of Swish-E. You need a special filter
which creates the TXT-file and a second file (e.g. .LST) which holds all
page and character offsets of each single word. Then you can use the
TXT-file for Swish-E indexing. Now Swish-E and Acrobat see words and
characters in an identical way.

> What I am referring to is the creation of a "pseudo-xml" file (as
> specified by Adobe ...

> So is there a way to query the index file to obtain the list of all
> occurrences of a word in a file (like a regular search), but also
> obtain the
> character offsets of each of those occurrences from the beginning of the
> page in that document?

The second file (.LST), you can use to produce the "pseudo-xml" file,
which Acrobat understands, and causes enabling of the buttons
"Previous Highlight" and "Next Highlight" (jumping from page to page
having a hit). You probably have to create a building block which
produces the XML-file based on the LST-file and the search criteria.

> Customizing the code is an option, but I'm hoping someone else has done
> something similar that I could get inspired from.
>
Is an option, but we decided to de-couple PDF hit-highlighting from
any search engine so we have more flexibility in switching search
engines (even make PDF hit-highlighting possible in Google, Yahoo,
MSN, ...).

Herman Knoops
KnoMan.com
Received on Wed Jun 28 03:10:57 2006