what you are describing is a situation I have had before. It's not a Swish-e
problem; it's a document management problem.
The solution is actually quite simple: only index one format of each document,
then offer the user the opportunity to read it in alternate formats.
Since you don't manage the document collection, it can be a bit harder to
achieve that solution, since the collection may not be organized in such a way
that makes it easy to tell if you are looking at mutiple formats of the same
document. Here's one solution someone found:
http://swish-e.org/archive/2005-02/8998.html
As for a single html document being split across several files, I had the same
situation. My solution was to create a virtual composite of those html files
(order was irrelevant) and fed to swish-e -S prog as a single document. That way
the PDF vs HTML issue was moot.
good luck,
pek
Shivakumar GN scribbled on 7/19/06 9:05 PM:
> On Sun, 2006-07-16 at 06:56 -0700, Bill Moseley wrote:
>
>>> I am using swish to search a large repository of files that are in
>>> html,pdf & doc format and serve the search results to the web clients.
>>> I have a requirement to reduce the ranking of a file if it has pdf or
>>> doc extension.
>> Would have been fun to be in that meeting. Everyone knows it's not
>> the content that's important but the container. Much of the U.S.
>> consumer economy is based on that.
>>
>
> It is not just a usability problem, there is a technical problem as well
> leading to irrelevant search results. Point #2 below describes the
> problem of incorrect ranking.
>
> 1. I have documentation that are duplicated in html and pdf. Also not
> all documentation is duplicated and it is difficult to remove the
> duplication since the repository is large and I am not the producer of
> it.
> 2. PDF is a monolithic book where as the same content in html is
> distributed into many pages in the form of chapters. Because of this a
> single PDF has a higher frequency of occurance of search words than
> html. Thus a whole bunch of PDFs invariably appear at the top of the
> search result (even the not so relevant ones).
> 3. Even if searches is able to find both html and PDF I would prefer
> html to come early since it is browseable. Web based experienced is not
> lost.
>
>
>>> From swish documentation or the discussion archives, I couldn't find
>>> any details along these lines.
>>> Can the rank order be customized based on file extension, if so how
>>> can this be done.
>> You are right. Swish doesn't give you a way to implement such a
>> thing. You would need to either modify the source code or sort your
>> results first by file extension (ExtractPath) then by rank and show
>> your results in groups.
>>
>
> For the short term, sorting the results by extension and then rank seems
> straight forward and quick to do. Will do this and checkout how good the
> results remain. Will look into the source if this is not satisfactory.
>
> I also found that providing different categories (multiple index files
> for different documentation) to search from using check boxes reduced
> the amount of irrelevance (though original problem remains). But this
> approach does not bring out the quality of the extra-ordinary tool that
> swish is.
>
> thanks.
> Shiv
>
>
>
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Wed Jul 19 20:40:46 2006