Skip to main content.
home | support | download

Back to List Archive

Re: duplicate entries in DB after regex performed on URLs?

From: Chad Day <CDay(at)not-real.mindshare.net>
Date: Wed Dec 07 2005 - 14:44:22 GMT
I definitely checked.  I've ran and re-ran the search, changing only the
use_cookies line, and it either works (indexes the PDF fine) or breaks
(as below) depending on the existence of that line.

I've tried adding another PDF, even though I know the original is fine,
and it breaks as well depending on the case above.

What sense of this to make, I don't know.

Thanks,
Chad

-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
Sent: Tuesday, December 06, 2005 4:58 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: duplicate entries in DB after regex performed on
URLs?

On Tue, Dec 06, 2005 at 04:30:41PM -0500, Chad Day wrote:
> http://dev.website.org/index.php?option=content&task=view&id=5 - Using
> HTML2 parser -  (39 words)
>
http://dev.website.org/index.php?option=com_content&task=view&id=4&Itemi
> d= - Using HTML2 parser -  (33 words)
> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to reconstruct xref
table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table

That just looks like a broken pdf.  Did you check?

> http://dev.website.org/files/Joomla Quick Start 1.0.pdf - Using HTML2
> parser -  (no words indexed)
>
http://dev.website.org/index.php?option=com_content&task=view&id=4&Itemi
> d=9 - Using HTML2 parser -  (33 words)
> 
> If I remove the use_cookies => 1, line from my spider.conf, it works
> fine and I return to having the issue of the PHPSESSIDs. 

My guess is that with cookies you are indexing different files -- or
your site has some kind of problem.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Dec 7 06:44:42 2005