dcha099@ec.auckland.ac.nz wrote:
> I am trying to use Swish-e to replace regular database querying. Since there are
> millions of records, there might be a problem with having millions of files on
> certain filesystems. Is there an easy solution for this?
I believe the standard practice is to set up a script that generates
"HTML" pages on the fly (without writing them to the filesystem). These
are then fed to the spider program (I haven't needed to touch the spider
code at all.)
You end up with something like this in the conf (the indexing
configuration) file:
IndexDir spider.pl ./NewsDB2.pl
where NewsDB2.pl makes a connection to the database and record-by-record
transforms the results into a pseudo-HTML page and passes them to
STDOUT. The spider picks them up one at a time and the indexing takes place.
If you go to:
http://www.swish-e.org/current/docs/SWISH-RUN.html
and look at the "prog - general purpose access method", there is an
example HTML script (in perl) of the key bits in the process.
The key is that the HTTP Path-Name needs to be manipulated to lead back
to an HTML addressable representation of that record in the database: e.g.
http://www.mydatabase.org/records.pl?ID=
where you finish the value with the unique $id.
In the latest version of the docs the relevance ranking logic is
explained. Note that if you map a particular field or fields from the
database into a <title> it will be weighted more heavily. Similarly you
may choose to export two (or more) copies of specific fields in a crude
way of weighting the relevance. Even better methods are anticipated.
Others with more experience in this are welcome to chip in.
Walter Lewis
Halton Hills
Received on Wed Jan 12 07:24:52 2005