- From: Bert Bos <bbos@mygale.inria.fr>
- Date: Wed, 16 Oct 1996 21:22:06 +0200 (MET DST)
- To: www-international@w3.org, www-html@w3.org
(To avoid duplicate e-mails, please reply only to www-international@w3.org) The next version of HTML will have a CLASS attribute on (nearly) all elements, as described in several documents ([1], [2], [3], [4]). The intention is to allow authors to attach semantic information to elements, in the form of keywords: <p class=abstract>... <em class=surname>... The keywords can also be picked up by a style sheet to display the element in a special way. However, there is a problem: a conflict between case-insensitivity and allowing non-ASCII characters. We'd like to be able to say that the above example is exactly the same as <P CLASS=ABSTRACT>... <EM CLASS=SURNAME>... This works well if the class is in ASCII. But there is a problem if the class is a French word, since the French normally omit accents from uppercase: café -> CAFE or CAFÉ ? I expect the French would be surprised when CLASS=TELEPHONE is not the same as class=téléphone. But in other languages this might well change the meaning of the word. German shows an example of that. It also shows a different problem (at least for implementers...), since a word may become longer in the uppercase version: weiß -> WEISS maße -> MASSE WEISS -> weiss MASSE -> masse The second columns shows a word that changes meaning (maße = measures, dimensions; masse = mass) In Turkish, both the i and the dotless-i are mapped to uppercase I: ilik -> ILIK *l*k -> ILIK (* = dotless-i) But the words mean different things (resp. marrow and cool). The Unicode standard has a table of case-conversions, and it claims to be good enough for all languages. Nevertheless, it will cause surprises in all three languages mentioned above. So it appears that case-conversions are language dependent. That's why, for example, there is setlocale() in POSIX. But is it practical to make the case rules for CLASS dependent on the language? Where would you get the language from? Or do we change the interpretation of CLASS, and say that it is just a code (class=xyz12, class=p-89x), that doesn't have to be human-readable? In that case ASCII is all we need. What do people think? Note that declaring CLASS as NAME or NAMES instead of CDATA (as in [1] and [2]) solves only part of the problem. It does make the case conversion well-defined, but also makes it language-independent. Bert [1] ftp://ietf.org/internet-drafts/draft-ietf-html-i18n-05.txt [2] ftp://ds.internic.net/rfc/rfc1942.txt [3] http://www.w3.org/pub/WWW/TR/WD-style [4] http://www.w3.org/pub/WWW/MarkUp/Cougar/HTML.dtd -- Bert Bos ( W 3 C ) http://www.w3.org/ http://www.w3.org/pub/WWW/People/Bos/ INRIA/W3C bert@w3.org 2004 Rt des Lucioles / BP 93 +33 93 65 77 71 06902 Sophia Antipolis Cedex, France
Received on Wednesday, 16 October 1996 15:22:15 UTC