- From: <J.Larmouth@iti.salford.ac.uk>
- Date: 29 Oct 96 8:57
- To: www-international@w3.org
========================================================================= Prof J Larmouth, University Director of Telematic Applications, IT Institute, University of Salford, Salford M5 4WT, England. J.Larmouth @ ITI.SALFORD.AC.UK Telephone: +44 161 745 5657 Fax: +44 161 745 8169 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - To: www-international@w3.org Subject: So ... what (human-level) "characters" for CLASS? The following remarks are based on my understanding of the discussion so far. (I am not here addressing the sub-thread of sorting, only comparison). It seems that we do need to allow "characters" in the CLASS attribute that are formed using 10646 combining characters, in order to meet the needs of some languages (Indic etc). I use the term "character" for the human-intuitive concept of "character", not the formal 10646 definition. Unless I misunderstood, someone stated that even for Western languages, there are some "characters" that cannot be represented with the current 10646 level 1 (some Danish and Swedish examples have been given in earlier mailings, but I am unsure whether these were with respect to sort order problems or to representation issues), but if there ARE Western language characters missing, we clearly cannot reasonably prohibit their use in the CLASS attribute (until such time as they are added to 10646 level 1). If comparisons of 10646 encodings of strings containing such "characters" are to produce results the users will be happy with, some specification (it could be - part of - an RFC, it could be an ISO Standard) needs to specify that: a) A "character" which has a level 1 representation in some version x of 10646 (such as A grave) has a canonical representation using that level 1 10646 character, **and other representations are only recognised if earlier versions of 10646 did not include that "character" at level 1**. b) For "version x" of 10646, where there is no level 1 encoding of a particular "character", we need explicit recognition in some (not necessarily 10646) specification of a set of human-level "characters" beyond those present in level 1, with a defined "canonical" order of 10646 characters that represent those human-level "characters". It is possible/likely that the definition of the human-level "characters" (and the "canonical" order of 10646 representation) will be in some cases ad hoc (the Danish and Swedish "characters"?) and in some cases algorithmic (Indic and Thai). But a clear specification of such "characters" and their canonical representation in version x of 10646 is certainly needed. It would be possible for this specification to appear in the HTML definition of the CLASS attribute. c) In subsequent "version x+1 onwards" specifications of 10646, where a level 1 encoding may have been introduced for these "characters" (likely for any missing Danish/Swedish charcters, unlikely for Indic/Thai), the "additional" specification will relate the new level 1 encoding to the earlier 10646 sequence, and engines conforming to "version x+1 onwards" will recognise the equivalence. (Of course, earlier engines will not understand the new level 1 code-point - that is unavoidable.) Are we anywhere near closure on this discussion of what is the appropriate character set and encoding for CLASS, and how we do comparisons? Perhaps someone else could offer a summary of where the discussion has got to? John L
Received on Tuesday, 29 October 1996 04:14:51 UTC