On Wed, Sep 05, 2007 at 01:06:43PM -0600, firstname.lastname@example.org wrote:
> I am trying to index Microsoft Document , Excel and PDF's. I do not want to
> index the content but just the titles.
> I have the following config
> # Example Swish-e Configuration file
> FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 %p"
> FileFilter .pdf pdftotext "%p -"
> # Define *what* to index
> # IndexDir can point to a directories and/or a files
> # Here it's pointing to the current directory
> # Swish-e will also recurse into sub-directories.
> IndexDir /opt/samba/CNR
> # But only index the .html files
> IndexOnly .doc .pdf
> # Show basic info while indexing
> IndexReport 1
> Now i know the index the content inside the files but i do not want to index
> the content,
I haven't used this in a while, but might try NoContents:
IndexContents HTML* .doc .pdf
NoContents .doc .pdf
That probably won't work for the .doc file because catdoc doesn't spit
out HTML (so no <title> to look for). Same for the pdf file.
What I'd do is find tools to extract the titles from .doc and .pdf
(pdfinfo for .pdf comes to mind) and either generate a simple HTML
file or filter out the title.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Users mailing list
Received on Wed Sep 5 18:17:55 2007