Case sensitivity in markup
OK, internationalization experts, here's your chance to help out in a
The ERB has considered the issue of case sensitivity in markup at some
length and has boiled down the options to just two alternatives:
1. Full case sensitivity in markup.
2. Case-folding using the Unicode rules.
We would like everyone on the WG who has a qualified opinion on this
issue to state which of these alternatives is preferable and why.
The following points should be borne in mind:
1. Option 1 has tremendous backward-compatibility implications.
Important as compatibility is, however, the primary purpose of this
activity is to enable standardized structured data to be generated and
served out to Web clients, not to preserve existing legacy data
2. The Unicode method folds uppercase to lowercase (see below). ISO
8879 folds lowercase to uppercase; that is, it defines uppercase
substitution (Clause 13.4.5) but not lowercase substitution.
Due to the short amount of time remaining before we freeze the XML 1.0
specification, we need your input on this question as soon as
From the Unicode Standard 2.0 (July 1996), Section 4.1:
_Case_ is a normative property of characters in certain alphabets
whereby characters are considered to be variants of a single
letter. These variants, which may differ markedly in shape and
size, are called the _uppercase_ letter (also known as _capital_
or _majuscule_) and the _lowercase_ letter (also known as _small_
or _minuscule_). The uppercase letter is generally larger than
the lowercase letter.
Because of the inclusion of certain composite characters for
compatibility, such as U+01F1 LATIN CAPITAL LETTER DZ, there is a
third case, called _titlecase_, which is used where the first
character of a word is to be capitalized. An example of such a
character is U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z.
The three case forms are UPPERCASE, Titlecase, lowercase.
For those scripts that have case (Latin, Greek, Cyrillic, and
Armenian), the case of a Unicode character can be obtained from
the character's name. This is true for only these four scripts.
Uppercase characters contain the word _capital_ in their names.
Lowercase characters contain the word _small._ The word _small_
in the names of characters from scripts other than the four just
listed has nothing to do with case. ...
The lowercase letter default case mapping is between the small
character and the capital character. The Unicode Standard case
mapping tables, which are informative, are on the CD-ROM [bound
in with the standard].
In a few instances, upper and lowercase mapping may differ from
language to language between writing systems that employ the same
letters.... [Examples cited are Turkish dotless "i" and French
e-acute.] However, in general the vast majority of case mappings
are uniform across languages....
Because there are many more lowercase forms than there are
uppercase or titlecase, it is recommended that the lowercase form
be used for normalization, such as when strings are case-folded
for loose comparison or indexing.
[Note:] Case correspondences are not always one-to-one: the
result of case folding may be a different character length than
in the source string (for example, U+00DF LATIN SMALL LETTER
SHARP S [glyph for sz shown] becomes "SS" in uppercase).