Skip to main content.
home | support | download

Back to List Archive

RE: duplicate entries in DB after regex performed on URLs?

From: Chad Day <CDay(at)not-real.mindshare.net>
Date: Tue Dec 06 2005 - 21:28:57 GMT
Bill,

Thanks for the reply.  I tried a different solution since my perl skills
are suspect and time is an issue.  I set use_cookies => 1, in the
spider.conf file, but then PDF indexing broke.

I can't turn up anything in the archives relating cookies to PDF issues.

A snippet of the output:

$ swish-e -c swish.conf -S prog -v 3
Parsing config file 'swish.conf'
Indexing Data Source: "External-Program"
Indexing "/usr/local/lib/swish-e/spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'spider.conf'

..

http://dev.website.org/index.php?option=content&task=view&id=5 - Using
HTML2 parser -  (39 words)
http://dev.website.org/index.php?option=com_content&task=view&id=4&Itemi
d= - Using HTML2 parser -  (33 words)
Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
http://dev.website.org/files/Joomla Quick Start 1.0.pdf - Using HTML2
parser -  (no words indexed)
http://dev.website.org/index.php?option=com_content&task=view&id=4&Itemi
d=9 - Using HTML2 parser -  (33 words)

If I remove the use_cookies => 1, line from my spider.conf, it works
fine and I return to having the issue of the PHPSESSIDs. 

Why would this be breaking?

Thanks,
Chad

-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org] 
Sent: Tuesday, December 06, 2005 3:46 PM
To: Chad Day
Cc: Multiple recipients of list
Subject: Re: duplicate entries in DB after regex performed on URLs?

On Tue, Dec 06, 2005 at 12:39:18PM -0800, Chad Day wrote:
> I'm using spider.pl to index my Joomla! Site, and it's spidering it
putting the PHP session variable (?PHPSESSID=askjhdskljashdk) on the
end.  In the process of indexing the site, this tag changes a few times
with a new session ID, so multiple copies of the same document get
indexed.  Also, the link appears in the DB with said session variable in
it.
> 
> I was able to modify my swish.conf file to remove the PHP session ID
variables:
> 
>        ReplaceRules regex /\?PHPSESSID.*$//i
>        ReplaceRules regex /&PHPSESSID.*$//i
>        
> but multiple entries still appear in the database for each document.
What am I doing wrong?

ReplaceRules is just rewriting the file name that it's indexing -- not
telling the spider to only fetch one of the files.

You have to tell the spider how to determine that a url is unique.

This has been discussed quite a bit on the list so you can probably
find info in the archives.

But, IIRC, what you want to do is remove the session id in a test_url
function in your spider config file.

Look at "check_link" in spider.pl -- it calls test_url *then* looks to
see if it's seen the file again.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Dec 6 13:28:59 2005