So ... what characters for CLASS?

=========================================================================
     Prof J Larmouth,  University Director of Telematic Applications,
     IT Institute,  University of Salford,  Salford M5 4WT,  England.

J.Larmouth @ ITI.SALFORD.AC.UK                Telephone: +44 161 745 5657
                                                    Fax: +44 161 745 8169
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

To:     www-international@w3.org

Subject:      So ... what (human-level) "characters" for CLASS?

The following remarks are based on my understanding of the discussion so
far.  (I am not here addressing the sub-thread of sorting,  only
comparison).

It seems that we do need to allow "characters" in the CLASS attribute
that are formed using 10646 combining characters,  in order to meet the
needs of some languages (Indic etc).  I use the term "character" for the
human-intuitive concept of "character",  not the formal 10646 definition.

Unless I misunderstood, someone stated that even for Western languages,
there are some "characters" that cannot be represented with the current
10646 level 1 (some Danish and Swedish examples have been given in
earlier mailings,  but I am unsure whether these were with respect to
sort order problems or to representation issues),   but if there ARE
Western language characters missing,  we clearly cannot reasonably
prohibit their use in the CLASS attribute (until such time as they are
added to 10646 level 1).

If comparisons of 10646 encodings of strings containing such "characters"
are to produce results the users will be happy with,  some specification
(it could be - part of - an RFC,  it could be an ISO Standard) needs to
specify that:

        a)      A "character" which has a level 1 representation in some
        version x of 10646 (such as A grave) has a canonical
        representation using that level 1 10646 character,  **and other
        representations are only recognised if earlier versions of
        10646 did not include that "character" at level 1**.

        b)      For "version x" of 10646,  where there is no level 1
        encoding of a particular "character",  we need explicit
        recognition in  some (not necessarily 10646) specification of a
        set of human-level "characters" beyond those present in level 1,
        with a defined "canonical" order of 10646 characters that
        represent those human-level "characters".  It is possible/likely
        that the definition of the human-level "characters" (and the
        "canonical" order of 10646 representation) will be in some cases
        ad hoc (the Danish and  Swedish "characters"?) and in some cases
        algorithmic (Indic and Thai).  But a clear specification of such
        "characters" and their canonical representation in version x of
        10646 is certainly needed.  It would be possible for this
        specification to appear in the HTML definition of the CLASS
        attribute.

        c)      In subsequent "version x+1 onwards" specifications of
        10646,  where a level 1 encoding may have been introduced for
        these "characters" (likely for any missing Danish/Swedish
        charcters, unlikely for Indic/Thai),  the "additional"
        specification will relate the new level 1 encoding to the earlier
        10646 sequence,  and engines conforming to "version x+1 onwards"
        will recognise the  equivalence.  (Of course,  earlier engines
        will not understand the new level 1 code-point - that is
        unavoidable.)

Are we anywhere near closure on this discussion of what is the
appropriate character set and encoding for CLASS,  and how we do
comparisons?  Perhaps someone else could offer a summary of where the
discussion has got to?

John L

Received on Tuesday, 29 October 1996 04:14:51 UTC