Well, commenting out those lines in swish.conf works. How this is
related to cookies or exactly why it's causing the PDFs to break, I
don't really know .. spider.pl converting the text to PDF and then the
config file tries to do it again? The inner workings of it are beyond
me, so I'll stop hypothesizing.
Thanks for your help.. glad things are indexing as they should be now.
Mindshare Interactive Campaigns, LLC
202.654.0832 - www.mindshare.net
From: Bill Moseley [mailto:email@example.com]
Sent: Wednesday, December 07, 2005 10:20 AM
To: Chad Day
Subject: Re: duplicate entries in DB after regex performed on URLs?
On Wed, Dec 07, 2005 at 10:12:09AM -0500, Chad Day wrote:
> $ cat swish.conf
> # Example configuration file
> SwishProgParameters spider.conf
> #ReplaceRules regex /\?PHPSESSID.*$//i
> #ReplaceRules regex /&PHPSESSID.*$//i
> # Tell Swish-e what to index (same as -i switch above)
> IndexDir /usr/local/lib/swish-e/spider.pl
> IndexFile /usr/local/apache/htdocs/website.index
> IndexOnly .php .txt .html .htm .pdf .xml .htm .shtml
IndexOnly has no effect here.
> # Index the PDF files
> FileFilter .pdf /usr/X11R6/bin/pdftotext '"%p" -'
Spider.pl is filtering -- this will attempt to filter already filtered
> # Tell Swish-e that .txt files are to use the text parser.
> IndexContents TXT* .txt .pdf
> IndexContents XML* .xml
> IndexContents HTML* .htm .html .shtml .php
The spider should tell swish what type the content is, so
IndexContents should not be needed.
It will take a few hours before I can look at the "cookie" issue.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Wed Dec 7 07:27:06 2005