On Jul 11, 2008, at 6:43 AM, Bill Moseley wrote:
> On Fri, Jul 11, 2008 at 01:11:56AM -0700, Jo Rhett wrote:
>> (query string?)
>>
>> So while debugging a different problem I looked at my httpd logs and
>> realized something I'd apparently missed before. The swish-e spider
>> is looping over the same files dozens and dozens of times, each time
>> with different query arguments. Because all of the links on the site
>> contain a query_string containing the page they came from and a
>> unique
>> id for the visitor (and a dynamic toolbar has links to every page),
>> this means that each page is indexed N-1 times, where N is the number
>> of pages on the site.
>
> Why don't you use cookies for session management? Your setup kind of
> makes it hard for browsers to do any caching.
It does. If the browser submits a cookie then it uses them. If the
browser doesn't submit a cookie then it adds query strings to track
the browser. Since spider ignores the cookies, it gets the query
strings added.
>> Is there an option to tell the swish spider to ignore the query
>> string
>> when considering URLs? I realize that this would be inappropriate
>> for many sites, but it is essential for this site, so an option would
>> be very useful.
>
> Quick search of the archives turns up this:
>
> http://swish-e.org/archive/2004-08/8106.html
I missed that, as it contains nothing I was searching for. Problem is
-- this isn't clear what he's talking about. Is this to modify
spider.pl? This is on a shared host, and only one customer has this
problem.
--
Jo Rhett
Net Consonance : consonant endings by net philanthropy, open source
and other randomness
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Jul 11 12:40:13 2008