So ... what characters for CLASS?
Prof J Larmouth, University Director of Telematic Applications,
IT Institute, University of Salford, Salford M5 4WT, England.
J.Larmouth @ ITI.SALFORD.AC.UK Telephone: +44 161 745 5657
Fax: +44 161 745 8169
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Subject: So ... what (human-level) "characters" for CLASS?
The following remarks are based on my understanding of the discussion so
far. (I am not here addressing the sub-thread of sorting, only
It seems that we do need to allow "characters" in the CLASS attribute
that are formed using 10646 combining characters, in order to meet the
needs of some languages (Indic etc). I use the term "character" for the
human-intuitive concept of "character", not the formal 10646 definition.
Unless I misunderstood, someone stated that even for Western languages,
there are some "characters" that cannot be represented with the current
10646 level 1 (some Danish and Swedish examples have been given in
earlier mailings, but I am unsure whether these were with respect to
sort order problems or to representation issues), but if there ARE
Western language characters missing, we clearly cannot reasonably
prohibit their use in the CLASS attribute (until such time as they are
added to 10646 level 1).
If comparisons of 10646 encodings of strings containing such "characters"
are to produce results the users will be happy with, some specification
(it could be - part of - an RFC, it could be an ISO Standard) needs to
a) A "character" which has a level 1 representation in some
version x of 10646 (such as A grave) has a canonical
representation using that level 1 10646 character, **and other
representations are only recognised if earlier versions of
10646 did not include that "character" at level 1**.
b) For "version x" of 10646, where there is no level 1
encoding of a particular "character", we need explicit
recognition in some (not necessarily 10646) specification of a
set of human-level "characters" beyond those present in level 1,
with a defined "canonical" order of 10646 characters that
represent those human-level "characters". It is possible/likely
that the definition of the human-level "characters" (and the
"canonical" order of 10646 representation) will be in some cases
ad hoc (the Danish and Swedish "characters"?) and in some cases
algorithmic (Indic and Thai). But a clear specification of such
"characters" and their canonical representation in version x of
10646 is certainly needed. It would be possible for this
specification to appear in the HTML definition of the CLASS
c) In subsequent "version x+1 onwards" specifications of
10646, where a level 1 encoding may have been introduced for
these "characters" (likely for any missing Danish/Swedish
charcters, unlikely for Indic/Thai), the "additional"
specification will relate the new level 1 encoding to the earlier
10646 sequence, and engines conforming to "version x+1 onwards"
will recognise the equivalence. (Of course, earlier engines
will not understand the new level 1 code-point - that is
Are we anywhere near closure on this discussion of what is the
appropriate character set and encoding for CLASS, and how we do
comparisons? Perhaps someone else could offer a summary of where the
discussion has got to?