Skip to main content.
home | support | download

Back to List Archive

Re: Not Saving Correct Title to the Index

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Aug 03 2006 - 01:45:59 GMT
On Wed, Aug 02, 2006 at 01:17:36PM -0700, Ken Schweigert wrote:
> I'm having trouble getting Swish-e to write the correct title to the  
> index.  I use the "swishspider" to index the site because it is a  
> dynamic site and uses mod_rewrite.

swishspider might be a problem.

> 1000 http://www.cedarhomes.com/cedar_homes/featured_articles_1//? 
> expand=44 "?expand=44" 11555

(Isn't it amazing that modern email clients can't paste without
wrapping?)

Hum, this doesn't make much sense:


    moseley(at)not-real.bumby:~/ken$ swishspider ken http://www.cedarhomes.com/cedar_homes/featured_articles_1

    moseley@bumby:~/ken$ head ken.contents 
    Cedar Homes :: Cedar Homes :: FEATURED ARTICLES<br><!-- START TEMPLATE: portal/menu.php -->
    <ul id="p7PMnav">
        <li><a href="/site_map.php" title="Cedar Homes Site Map">SITE MAP</a></li>
        <li><a href="/design_center/" title="Design Center">DESIGN CENTER</a></li>
        <li><a href="http://www.cedarhomes.com/cedar_homes/contact_us_1" class="p7PMtrg">CONTACT US</a>
        <ul>
        <li><a href="http://www.cedarhomes.com/cedar_homes/contact_us_1/contact_our_design_staff">CONTACT OUR DESIGN STAFF</a> 

    </li><li><a href="http://www.cedarhomes.com/cedar_homes/contact_us_1/about_our_design_managers">ABOUT OUR DESIGN MANAGERS</a>



Try the new spider -- looks better.

    moseley(at)not-real.bumby:~/ken$ /usr/local/lib/swish-e/spider.pl default http://www.cedarhomes.com/cedar_homes/featured_articles_1  | head
    /usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
    Path-Name: http://www.cedarhomes.com/cedar_homes/featured_articles_1
    Content-Length: 22973
    Document-Type: html*

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <html>
    <head>
    <title>Cedar Homes :: Cedar Homes :: FEATURED ARTICLES</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Which is weird because they both use Perl's LWP module to fetch the
remote document.

Looks like the swishspider program isn't fetching the document
correctly.  But if I try it on other sites it looks fine:

    moseley(at)not-real.bumby:~/ken$ swishspider ken http://slashdot.org/
    moseley@bumby:~/ken$ head ken.contents 
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
                "http://www.w3.org/TR/html4/strict.dtd">
    <html>
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

    <title>Slashdot: News for nerds, stuff that matters</title>

    <link rel="stylesheet" type="text/css" media="screen, projection" href="//images.slashdot.org/base.css?T_2_5_0_120">
    <link rel="stylesheet" type="text/css" media="screen, projection" href="//images.slashdot.org/ostgnavbar.css?T_2_5_0_120">


That's really odd.  You have validation errors on your site, but
nothing I see that would confuse things.  And like I said,
swishspider, GET, and spider.pl all use the same Perl module to fetch
that page.

Anyway, any reason you are not using spider.pl to spider your pages?

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Aug 2 18:46:04 2006