Re: Internationalized CLASS attributes
Jonathan Rosenne wrote:
>Bert Bos wrote:
>> The next version of HTML will have a CLASS attribute on (nearly) all
>> elements, as described in several documents (, , , ). The
>> intention is to allow authors to attach semantic information to
>> elements, in the form of keywords:
>> <p class=abstract>...
>> <em class=surname>...
>> The keywords can also be picked up by a style sheet to display the
>> element in a special way.
>> However, there is a problem: a conflict between case-insensitivity and
>> allowing non-ASCII characters. We'd like to be able to say that the
>> above example is exactly the same as
>> <P CLASS=ABSTRACT>...
>> <EM CLASS=SURNAME>...
>I used to write COBOL, but then I began to C...
>I don't believe there is added value in case-insensitivity this day and
>age. Are there any of those terminals that always display upper case
>still around? Those with the a->A switch?
>I suggest that the class names should be defined as case sensitive.
I completely agree with Jonathan. There is no reason for such
Declaring things NAME and relying on SGML does not work very well,
and starting to define your own case equivalence for CDATA is too
much effort vasted for too little benefit.
>A friendly browser could, of course, do a case insensitive search if the
>case sensitive search fails.
NO, PLEASE! Users will have no big problems if browsers clearly reject
to display things that don't match. Users have no problem distinguishing
upper case and lower case, if they are told to do so. But they won't
learn it if browsers don't tell them, and will get confused if different
browsers show different behaviour.
Let's try not to make the same mistakes as with other HTML syntax.
Let's try to avoid bugwards compatibility.
>ASCII only names are too limiting. People should be able to name things
>in their own language.
When we designed the i18n extensions for HTML, we decided to not
extend the character set for tags beyond ASCII. I think this was
okay because it affected only the limited set of existing tags.
For class names, which can be anything, this restriction is
definitely less justified.
>But there is another problem with internationalized names: UCS defines a
>non-unique coding. Some composite characters have at least two valid
>representations, the composed character and the base character followed
>by diacritics. If there is more than one diacritics, their order is not
>defined. The user often has no control over the coding. So before using
>a name, it must be brought to a canonical representation.
This is definitely a problem that should be addressed. In Java, it was
"solved" the easy way, saying that different encodings are different
identifiers. Maybe this was okay for real programmers. For HTML users,
it's definitely not okay.
Still, in this case, it's rather easy because only equivalence has
to be specified. I am currently working on something else, the
internationalization of URLs, where equivalence is probably not
enough, and where a normalized encoding is desired.
There are other, related cases of equivalence. One is the full-width/
half-width issue for East Asian charcater sets. From a user point of
view, these can easily be distinguished, so it is not necessary to
unify them. On the other hand, one variant is clearly a compatibility
variant, so unification to get rid of compatibility is also not a
Even less of a concern are compatibility ligatures. They should rarely
if ever be used when creating HTML.