Skip to main content.
home | support | download

Back to List Archive

RE: duplicate entries in DB after regex performed on URLs?

From: Chad Day <CDay(at)not-real.mindshare.net>
Date: Wed Dec 07 2005 - 15:27:06 GMT
Well, commenting out those lines in swish.conf works.  How this is
related to cookies or exactly why it's causing the PDFs to break, I
don't really know .. spider.pl converting the text to PDF and then the
config file tries to do it again?  The inner workings of it are beyond
me, so I'll stop hypothesizing.

Thanks for your help.. glad things are indexing as they should be now.

Chad Day
Developer
Mindshare Interactive Campaigns, LLC
202.654.0832 - www.mindshare.net 

-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org] 
Sent: Wednesday, December 07, 2005 10:20 AM
To: Chad Day
Cc: swish-e@sunsite3.berkeley.edu
Subject: Re: duplicate entries in DB after regex performed on URLs?

On Wed, Dec 07, 2005 at 10:12:09AM -0500, Chad Day wrote:
> swish.conf:
> 
> $ cat swish.conf 
> # Example configuration file
> 
> SwishProgParameters spider.conf
> #ReplaceRules regex /\?PHPSESSID.*$//i
> #ReplaceRules regex /&PHPSESSID.*$//i
>  
> # Tell Swish-e what to index (same as -i switch above)
> IndexDir /usr/local/lib/swish-e/spider.pl
> IndexFile /usr/local/apache/htdocs/website.index 
> IndexOnly .php .txt .html .htm .pdf .xml .htm .shtml

IndexOnly has no effect here.


> 
> # Index the PDF files
> FileFilter .pdf /usr/X11R6/bin/pdftotext '"%p" -'

Spider.pl is filtering -- this will attempt to filter already filtered
content.


> 
> # Tell Swish-e that .txt files are to use the text parser.
> IndexContents TXT* .txt .pdf
> IndexContents XML* .xml
> IndexContents HTML* .htm .html .shtml .php 

The spider should tell swish what type the content is, so
IndexContents should not be needed.

It will take a few hours before I can look at the "cookie" issue.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Dec 7 07:27:06 2005