Internationalized CLASS attributes

(To avoid duplicate e-mails, please reply only to
www-international@w3.org)


The next version of HTML will have a CLASS attribute on (nearly) all
elements, as described in several documents ([1], [2], [3], [4]). The
intention is to allow authors to attach semantic information to
elements, in the form of keywords:

    <p class=abstract>...
      <em class=surname>...

The keywords can also be picked up by a style sheet to display the
element in a special way.

However, there is a problem: a conflict between case-insensitivity and
allowing non-ASCII characters. We'd like to be able to say that the
above example is exactly the same as

    <P CLASS=ABSTRACT>...
      <EM CLASS=SURNAME>...

This works well if the class is in ASCII. But there is a problem if
the class is a French word, since the French normally omit accents
from uppercase:

    café -> CAFE or CAFÉ ?

I expect the French would be surprised when CLASS=TELEPHONE is not the
same as class=téléphone. But in other languages this might well change
the meaning of the word.

German shows an example of that. It also shows a different problem (at
least for implementers...), since a word may become longer in the
uppercase version:

    weiß -> WEISS    maße -> MASSE
    WEISS -> weiss   MASSE -> masse

The second columns shows a word that changes meaning (maße = measures,
dimensions; masse = mass)

In Turkish, both the i and the dotless-i are mapped to uppercase I:

    ilik -> ILIK
    *l*k -> ILIK  (* = dotless-i)

But the words mean different things (resp. marrow and cool).

The Unicode standard has a table of case-conversions, and it claims to
be good enough for all languages. Nevertheless, it will cause
surprises in all three languages mentioned above.

So it appears that case-conversions are language dependent. That's
why, for example, there is setlocale() in POSIX. But is it practical
to make the case rules for CLASS dependent on the language? Where
would you get the language from?

Or do we change the interpretation of CLASS, and say that it is just a
code (class=xyz12, class=p-89x), that doesn't have to be
human-readable? In that case ASCII is all we need.

What do people think?

Note that declaring CLASS as NAME or NAMES instead of CDATA (as in [1]
and [2]) solves only part of the problem. It does make the case
conversion well-defined, but also makes it language-independent.


Bert


[1] ftp://ietf.org/internet-drafts/draft-ietf-html-i18n-05.txt
[2] ftp://ds.internic.net/rfc/rfc1942.txt
[3] http://www.w3.org/pub/WWW/TR/WD-style
[4] http://www.w3.org/pub/WWW/MarkUp/Cougar/HTML.dtd

-- 
  Bert Bos                                ( W 3 C ) http://www.w3.org/
  http://www.w3.org/pub/WWW/People/Bos/                      INRIA/W3C
  bert@w3.org                             2004 Rt des Lucioles / BP 93
  +33 93 65 77 71                 06902 Sophia Antipolis Cedex, France

Received on Wednesday, 16 October 1996 15:22:15 UTC