On Sun, 2006-07-16 at 06:56 -0700, Bill Moseley wrote:
> > I am using swish to search a large repository of files that are in
> > html,pdf & doc format and serve the search results to the web clients.
> > I have a requirement to reduce the ranking of a file if it has pdf or
> > doc extension.
> Would have been fun to be in that meeting. Everyone knows it's not
> the content that's important but the container. Much of the U.S.
> consumer economy is based on that.
It is not just a usability problem, there is a technical problem as well
leading to irrelevant search results. Point #2 below describes the
problem of incorrect ranking.
1. I have documentation that are duplicated in html and pdf. Also not
all documentation is duplicated and it is difficult to remove the
duplication since the repository is large and I am not the producer of
2. PDF is a monolithic book where as the same content in html is
distributed into many pages in the form of chapters. Because of this a
single PDF has a higher frequency of occurance of search words than
html. Thus a whole bunch of PDFs invariably appear at the top of the
search result (even the not so relevant ones).
3. Even if searches is able to find both html and PDF I would prefer
html to come early since it is browseable. Web based experienced is not
> > From swish documentation or the discussion archives, I couldn't find
> > any details along these lines.
> > Can the rank order be customized based on file extension, if so how
> > can this be done.
> You are right. Swish doesn't give you a way to implement such a
> thing. You would need to either modify the source code or sort your
> results first by file extension (ExtractPath) then by rank and show
> your results in groups.
For the short term, sorting the results by extension and then rank seems
straight forward and quick to do. Will do this and checkout how good the
results remain. Will look into the source if this is not satisfactory.
I also found that providing different categories (multiple index files
for different documentation) to search from using check boxes reduced
the amount of irrelevance (though original problem remains). But this
approach does not bring out the quality of the extra-ordinary tool that
Received on Wed Jul 19 19:06:31 2006