Re: [swish-e] Index Doc , excel , pdf Titles Only

From: Bill Moseley <moseley(at)>
Date: Wed Sep 05 2007 - 22:17:55 GMT
On Wed, Sep 05, 2007 at 01:06:43PM -0600, wrote:
> I am trying to index Microsoft Document , Excel and PDF's. I do not want to
> index the content but just the titles.
> I have the following config
>   # Example Swish-e Configuration file
> FileFilter .doc       /usr/local/bin/catdoc "-s8859-1 -d8859-1 %p"
> FileFilter .pdf       pdftotext   "%p -"
>     # Define *what* to index
>     # IndexDir can point to a directories and/or a files
>     # Here it's pointing to the current directory
>     # Swish-e will also recurse into sub-directories.
>     IndexDir /opt/samba/CNR
>     # But only index the .html files
>     IndexOnly .doc .pdf
>     # Show basic info while indexing
>     IndexReport 1
> Now i know the index the content inside the files but i do not want to index
> the content,

I haven't used this in a while, but might try NoContents:

    IndexContents HTML* .doc .pdf
    NoContents .doc .pdf

That probably won't work for the .doc file because catdoc doesn't spit
out HTML (so no <title> to look for). Same for the pdf file.

What I'd do is find tools to extract the titles from .doc and .pdf
(pdfinfo for .pdf comes to mind) and either generate a simple HTML
file or filter out the title.

Bill Moseley

Received on Wed Sep 5 18:17:55 2007