Skip to main content.
home | support | download

Back to List Archive

Re: duplicate entries in DB after regex performed on URLs?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Dec 06 2005 - 21:57:34 GMT
On Tue, Dec 06, 2005 at 04:30:41PM -0500, Chad Day wrote:
> http://dev.website.org/index.php?option=content&task=view&id=5 - Using
> HTML2 parser -  (39 words)
> http://dev.website.org/index.php?option=com_content&task=view&id=4&Itemi
> d= - Using HTML2 parser -  (33 words)
> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table

That just looks like a broken pdf.  Did you check?

> http://dev.website.org/files/Joomla Quick Start 1.0.pdf - Using HTML2
> parser -  (no words indexed)
> http://dev.website.org/index.php?option=com_content&task=view&id=4&Itemi
> d=9 - Using HTML2 parser -  (33 words)
> 
> If I remove the use_cookies => 1, line from my spider.conf, it works
> fine and I return to having the issue of the PHPSESSIDs. 

My guess is that with cookies you are indexing different files -- or
your site has some kind of problem.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Dec 6 13:57:35 2005