Re: A character is in the eye of the beholder
Martin J Duerst writes:
> Keld Simonsen wrote:
> >Martin Bryan writes:
> >> This doesn't work either as some languages require accented characters to be
> >> placed at the end of the list. CEN TC304 are working on a set of sorting
> >> rules for ISO 10646, which i18n should adopt as soon as ready for European
> >> languages, but the sorting problems of CJK will need to be met by other
> >> means as the same glyph can mean different things in different
> >> contexts/languages.
> >Which languages require accented characters placed at the end of the list?
> I guess Martin ment languages like Danish and Swedish, which put some
> accented characters (which they might not call accented character)
> at the end of their alphabet.
OK, I know of these requirements (being Danish myself!).
These are not considered accented characters in Danish, Swedish,
Norwegian, Finnish etc, but they are consideredd genuine letters.
For Danish it is ÆØÅ, placed after Z in the ordering.
In the case of ordering it is recoqnized that ordering is
a cultural convention that may differ from language/culture to
language/culture. So the ISO ordering standard (CD 14651) is
designed to be easily tailorable to accomodate languages like
Swedish and Danish.
Well, does the ordering have influence on WWW ?
I think it may do so, for example when a server sends over a
sorted list of files in a directory. It may also influence indexes
which could be of some importance in the chaotic world of the web.
So I think the web should support culturally dependent ordering.
> >For CJK there are a number of ways to sort 10646, and WG20 will specify
> >one. There may be more specified by national standardization bodies.
> >Will this not be adequate for a number of purposes?
> >SC2/WG2 will have sorting information available for all CJK characters.
> Sorting ideographs as such, e.g. by some of their graphical properties,
> is something that you may do if you don't have any other information.
> And it's fairly easy, as the ideographs are already sorted that way
> currently in ISO 10646, with two exceptions: (1) The ordering is based on
> some traditional dictionaries; things that many people nowadays would
> sort different are not considered. (2) With the addition of ideographic
> supplement(s), the interleaving of two or more collections has to be
Yes, we can sort 10646 CJK characters quite easily, when we only
look at the iformation available with these character (when we do not
have more information on for example the pronounciation).
There are several ways of sorting CJK characters, one is to use the
binary order of 10646. Others are to use other establihed schemes.
The CJK additions to 10646 will have a proposed ordering included
to merge with the current 10646 ordering.
> However, for many if not most purposes, it is customary to sort
> ideographs phonetically. Because, as Martin has mentionned, pronounciation
> of an ideograph depends on language and context, and the different
> languages have different phonetic sorting orders, it's impossible
> to say that ideograph A comes before ideograph B in all cases.
> What you need e.g. for correct sorting in an index, is to
> annotate the words and expressions you want to sort with phonetic
> information, and to use this phonetic information for sorting.
Yes, that is also my understanding.
But given that you do not have pronounciation data available for
a CJK string, I would say that the specifications of ISO CD 14651
is adequate for ordering them.