On Sat, Aug 05, 2006 at 03:01:03PM -0700, Eric Jobidon wrote:
> I guess I could grep the output of xpdf and replace "&" with "&", but I
> am thinking there is a much simpler answer to this, either as a config
> directive in .xpdfrc (replacing "&" with "&") or in the swish-e config
> file (maybe treating "&" as a stop word). Any thoughts? Anyone had the same
> issue? How would you fix it?
See if xpdf can't produce correct html?
Is the error:
error: htmlParseEntityRef: no name
That's from libxml2 and setting this in your swish config might
> 2- (This question pertains to a different environment, on a unix box) I want
> to offer syntax highlighting in the search results page. All the indexed
> docs are PDF files. And they are all fairly large PDF files (some are over
> 100MB in size, and splitting them up is not an option).
Splitting them up while indexing is probably your best, if not only,
option. Then your search results could be, for example, targeted to
a specific page or chapter of the pdf file, assuming you can figure
> One avenue I am considering is to save the output from xpdf in a html file
> and indexing only that html file (ignoring the PDF altogether for the
> indexing). On a search, the cgi script would then parse the html file,
> highlight the content and display it to the user. The displayed document URL
> could then simply be transformed from ".html" to ".PDF". Think this would
> work? Any other avenue I should explore?
Typically, the content gets stored as a property in the swish index
and that's what is displayed for highlighting. So, what the source
document is doesn't really matter.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Sat Aug 5 21:14:28 2006