On Fri, Jun 10, 2005 at 06:56:30AM -0700, koszalekopalek wrote:
> Bill Moseley wrote:
> > On Thu, Jun 09, 2005 at 03:05:59PM -0700, koszalekopalek wrote:
> > And even with billions of pages it turns out easy to hack. I think
> > the search for "miserable failure" is a common example.
> :-) I wasn't familiar with that one. But it looks
> just like that 'moron' example I mentioned.
> > BTW, on the swish-e site, the indexer never sees a link for "docs"
> > when indexing the main page. Notice all the <!-- noindex --> tags?
> Which program filters it out -- swish or spider.pl?
Kind of both in this case. First, there's two indexes that are
searched. They are not merged so that may effect ranking a bit
(Peter, you are more awake than I am, so you might need to help with
that. ;) The list archive is indexed with the hypermail script that's
part of the distribution. That extracts out data from messages for
The site has its pages indexed in chunks so search results will not
just take you to the page, but to the section of the page. That
likely effects the ranking you are hoping for since the entire page is
not indexed. You can see how it's split here into sections:
Then once passed to swish any <!-- noindex --> tags are also observed
while swish is parsing.
> So far I have used spider.pl (i.e. the callback routines)
> to filter HTML but maybe there is some configuration
> directive for swish that I missed?
That I cannot answer. ;)
Unsubscribe from or help with the swish-e list:
Help with Swish-e:
Received on Fri Jun 10 07:20:56 2005