Skip to main content.
home | support | download

Back to List Archive

Re: Where's my descriptions and titles?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jan 24 2002 - 14:27:36 GMT
At 05:39 AM 01/24/02 -0800, Rich Thomas wrote:
>I've included my confg file and a sample of results.

Thanks!

>Why do I get Null
>titles and no descriptions? 

Titles work, so I'd need to see your CGI script.  But you don't have a
<body> tag, and the basic HTML parser is not smart enough to fix your
broken HTML.

> head rich.html 
<title> E/E/F/8/403 University at Buffalo Libraries Web Catalog</title> <br>
<h3>
United States.  Bureau of Land Management.</h3>  <br>
<h3>
BLM Wyoming fishing opportunities /  United States Department of the
Interior, Bureau of Land Management.</h3>  <br>
<h3>
Wyoming fishing opportunities</h3>  <br>
<h3>
Title within map border:  Fishing opportunities [place] Wyoming</h3>  <br>


> cat rich.conf
StoreDescription HTML <body> 5000

> ./swish-e -c rich.conf -i rich.html -v 0
Indexing Data Source: "File-System"
Indexing done!

> ./swish-e -w horn -p swishdescription -H0
1000 rich.html "E/E/F/8/403 University at Buffalo Libraries Web Catalog"
1630 ""
.

See, we are getting the title, but not the description because there's no
body tag.  Now, let's get the help from libxml2 (because it will attempt to
fix your html):

> cat rich.conf
StoreDescription HTML2 <body> 5000
DefaultContents HTML2

> ./swish-e -c rich.conf -i rich.html -v 0
Indexing Data Source: "File-System"
Indexing done!

> ./swish-e -w horn -p swishdescription -H0
1000 rich.html "E/E/F/8/403 University at Buffalo Libraries Web Catalog"
1630 "United States. Bureau of Land Management. BLM Wyoming fishing
opportunities / 
..

There, libxml2 came to the rescue.

>How do I force swish-e not to follow all links when using the http method?
>Is this even possible?

robots.txt.  Standard robots exclusion.

Too bad FileRules doesn't work on URLs.

I believe I've dropped hints all over the 2.1 docs that I'm not a fan of
the -S http method.

If you use -S prog and the spider.pl program you can have full control over
spidering.  You can use robots.txt, or use the <META> robots exclusion tags
per document, or perl regular expressions or anything you can imagine to
control what is spidered.


-- 
Bill Moseley
mailto:moseley@hank.org
Received on Thu Jan 24 14:28:07 2002