On Tue, Dec 06, 2005 at 12:39:18PM -0800, Chad Day wrote:
> I'm using spider.pl to index my Joomla! Site, and it's spidering it putting the PHP session variable (?PHPSESSID=askjhdskljashdk) on the end. In the process of indexing the site, this tag changes a few times with a new session ID, so multiple copies of the same document get indexed. Also, the link appears in the DB with said session variable in it.
> I was able to modify my swish.conf file to remove the PHP session ID variables:
> ReplaceRules regex /\?PHPSESSID.*$//i
> ReplaceRules regex /&PHPSESSID.*$//i
> but multiple entries still appear in the database for each document. What am I doing wrong?
ReplaceRules is just rewriting the file name that it's indexing -- not
telling the spider to only fetch one of the files.
You have to tell the spider how to determine that a url is unique.
This has been discussed quite a bit on the list so you can probably
find info in the archives.
But, IIRC, what you want to do is remove the session id in a test_url
function in your spider config file.
Look at "check_link" in spider.pl -- it calls test_url *then* looks to
see if it's seen the file again.
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Tue Dec 6 12:46:36 2005