On Tuesday 15 April 2003 03:54 pm, Bill Moseley wrote:
> On Tue, 15 Apr 2003, Douglas Smith wrote:
> > Well, what I was thinking of is a property which contains numbers.
> > Call it something like 'rankfactor' or 'swishrankfactor' which can
> > be put in a meta tag in the html head. Then at search time, the rank
> > is computed, and then this factor can be applied, just multipled
> > to the computed rank.
> That's a lot of disk accesses. So once swish gets its list of results it
> would need to walk the list and read the property file for each
> result. As it is now only the results that are displayed have their
> properties accessed.
> A "better" way might be to build a big table in memory that can be
> accessed fast. That's done when IgnoreTotalWordsPerFile (or whatever it's
> called) is set to false -- an extra table is build and that's loaded into
> A "more sensible" way would be to use a fast hash table on disk, I
> > Is this a big design change? (Maybe, perhaps I should just go
> > through the code myself before aksing for too much...)
> Sure, you are welcome to enter the scary world of swish-e code. It's ugly
> in there, but slowly getting better.
Well, I just can't just let this go, I just think there needs to
be a way to tell the swish-e index that certain pages are more
important than others. I want to figure out how to do this.
I went to a talk on Friday at Stanford given by a couple guys from
Google about information retrieval systems, index design and user
interfaces. The talks were fairly hand-wavy, without much detail
but some interesting points were brought up in the ranking of web
pages. They talked about Google using systems which need to handle
the main idea: which pages are authorative and are believed to be
good sources of information? And how to do this when people are
out there trying to game your system, and change the ranking to
But they talked about the 'pagerank' system, which looks at the
number of links to a page, but also the links to pages which link
to the original page. So, if a lot of people link to a page, it
doesn't much matter unless a lot of people link to them also.
And the system goes back a quite a few levels of links to determine
the rank of a page.
But this system assumes a few things, that there are authorative
pages, and enough links between pages so you can determine which
pages are the authorities. This is true in the internet, with
a billion pages to search and measure links, but it is less true
in a limited setting, like an organization with up to a few 100k pages
to index. Esp. if in a limited organizion there was never anyone
setup to create authortative pages, if the pages are just created
through the years in a fairly random manner.
And it is esp. less true if you are trying to index something which
is inherently more structured, like a messaging system with threaded
topics, or a catalog of books, or anything other than a web of
I think this is where swish-e is best put to use, for medium
sized (a few 1000 to a few 100k 'pages' of information), where
people need to tailor the search index to what people need to
retrieve. But you can't develop the system, link 'pagerank' to
best handle all structures, certain structures will not rank
pages very well.
So, there is a need in a medium sized indexing system for the
user to specify to the index which pages are more important,
so they get ranked higher.
Of course Google would not supply this, since people all over
the internet would just up their rank factor to something crazy,
but in a controlled org. this isn't true.
I hope this isn't just noise on this mailing list... I should
start going after the code and see there could be some way to
put this in. I mean you have to get info about the page out
of the index anyway, not just the properties, at least the
url or filename. When this is retrieved couldn't a rank factor
be retieved also?
Douglas A. Smith email@example.com
Office: Bld 280, Rm 157 (650)926-2369
Received on Mon Apr 21 21:00:17 2003