Skip to main content.
home | support | download

Back to List Archive

RE: duplicate entries in DB after regex performed on URLs?

From: Chad Day <CDay(at)not-real.mindshare.net>
Date: Wed Dec 07 2005 - 15:21:53 GMT
A lot of that is left over from when I was using the http method to
spider the site.  I'll make those changes and see if it bizarrely
affects the cookie issue (wouldn't surprise me).

Thanks for the information.

Chad 

-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org] 
Sent: Wednesday, December 07, 2005 10:20 AM
To: Chad Day
Cc: swish-e@sunsite3.berkeley.edu
Subject: Re: duplicate entries in DB after regex performed on URLs?

On Wed, Dec 07, 2005 at 10:12:09AM -0500, Chad Day wrote:
> swish.conf:
> 
> $ cat swish.conf 
> # Example configuration file
> 
> SwishProgParameters spider.conf
> #ReplaceRules regex /\?PHPSESSID.*$//i
> #ReplaceRules regex /&PHPSESSID.*$//i
>  
> # Tell Swish-e what to index (same as -i switch above)
> IndexDir /usr/local/lib/swish-e/spider.pl
> IndexFile /usr/local/apache/htdocs/website.index 
> IndexOnly .php .txt .html .htm .pdf .xml .htm .shtml

IndexOnly has no effect here.


> 
> # Index the PDF files
> FileFilter .pdf /usr/X11R6/bin/pdftotext '"%p" -'

Spider.pl is filtering -- this will attempt to filter already filtered
content.


> 
> # Tell Swish-e that .txt files are to use the text parser.
> IndexContents TXT* .txt .pdf
> IndexContents XML* .xml
> IndexContents HTML* .htm .html .shtml .php 

The spider should tell swish what type the content is, so
IndexContents should not be needed.

It will take a few hours before I can look at the "cookie" issue.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Dec 7 07:21:54 2005