RE: JW24a (i18n sort ordering) - Unicode 3.0 from Martin J. Duerst on 2000-05-09 (www-webdav-dasl@w3.org from April to June 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Tue, 09 May 2000 20:20:26 +0900
To: Jim Whitehead <ejw@ics.uci.edu>, infonuovo@email.com, www-webdav-dasl@w3.org
Message-Id: <4.2.0.58.J.20000509201553.0302d6b0@sh.w3.mag.keio.ac.jp>

At 00/05/03 16:55 -0700, Jim Whitehead wrote:

>Dennis Hamilton writes:
> > The Unicode 3.0 specification does address sort orders more.
>
>This is good to hear.
>
> > I am not that confident in implementers as you!  (I am sitting here in
> > Italy watching multi-language issues show up left and right as I
> > research some connectivity problems using the Internet and various
> > European -- the problem should be understood here, yes? -- customer
> > support numbers, etc.)
>
>OK, so I've been thinking about this problem a little more, and it still
>seems really hard.
>
>Let's look at one problem:
>
>Assume you have a set of resources, a third of which are in English, a third
>of which are in Italian, and the final third are in Japanese.  The titles of
>these are stored as WebDAV properties, in the native language and character
>set.

'native character set' doesn't exist. 'customary legacy character encoding'
is probably what you meant. The actual encoding should be irrelevant
for the sort order.


>Now assume you do a DASL query to retrieve the titles of some subset,
>with the results returned in ascending order.  Further assume that the
>result set contains some English, German, and Japanese resources.

Why suddenly German instead of Italian?


>What would be a reasonable response for a DASL server?  One answer is to
>just return them grouped by language, returning first the Italian ones, then
>all the English ones, then all the Japanese ones.

Well, if the user wants it that way, then the server should do that.
But wanting it that way is about the same as wanting first all
html pages, then all gifs, then jpegs,...



>But, even here this isn't
>sufficient.  What if several of the resources are about the same subject?
>In this case, the user would prefer to have them ordered by subject (for
>example, all resources about Rome should be grouped together, whether Rome
>is spelled "Rome" or "Roma").

Well, sometimes the spelling helps, but sometimes not. Rome is written
in Katakana in Japanese.


>So, it might make sense to sort all languages
>expressable using ISO-Latin-1 characters (I think these correspond to the
>first 255 characters in Unicode) together, but then list all other resources
>by language.

No, it makes sense to sort by characters, independent of language,
because users look for characters in a sorted list, not for language.


>However, I suspect there are probably other language groups that share the
>same alphabet, but I have no idea what they are.

Oh well, very very much so. Japanese kana are used to write Ainu,
Hebrew is used to write Yiddish and Ladino, Arabic and Cyrillic are
used for a bunch of languages, and so on.


Regards,   Martin.

Received on Tuesday, 9 May 2000 07:43:03 UTC