On Sun, Dec 14, 2003 at 11:15:02AM -0800, Frances Coakley wrote:
>
> > > There is NO WAY to store more than one encoding in the index as it is
> > > currently designed.
>
> Doesnt the meta charset give you the coding used in the original document -
> assuming that the 8bit chars are the more unusual chars then it is possible
> that a word in Icelandic charset maps onto the same sequence of 8 bit chars
> as would a different word in the Norse charset. But if the searcher is
> viewing with the charset Icelandic set then searching for Meta
> Charset=Icelandic and word=whatever will find the Icelandic word. Those
> pages not encoded under the Icelandic charset cannot by definition contain
> that char.
> Or have I misunderstood the problem ?
Yes, that would work because you are not mixing encodings in the
same index. John's suggestion was to index "as-is" which would mix
encodings.
Since metanames are sub-sets of documents (with the exception of some of
the true meta data like dates or pathnames) you would need a complete
duplicate set of metanames for each encoding found. Probably would be
easier to design a system that selects an index file based on character
encoding. But that's still limited to 8 bit character sets. So utf-8
is where the effort should go.
Some features in swish are based on characters being 8-bit. I think the
wild card feature (foo*) uses a 256 wide lookup table. I can't remember
for sure.
--
Bill Moseley
moseley@hank.org
Received on Sun Dec 14 21:24:53 2003