Case sensitivity in markup from Jon Bosak on 1996-11-04 (w3c-sgml-wg@w3.org from November 1996)

From: Jon Bosak <bosak@atlantic-83.Eng.Sun.COM>
Date: Sun, 3 Nov 1996 20:14:10 -0800
To: w3c-sgml-wg@w3.org
Message-Id: <199611040414.UAA05499@boethius.eng.sun.com>
OK, internationalization experts, here's your chance to help out in a
big way.

The ERB has considered the issue of case sensitivity in markup at some
length and has boiled down the options to just two alternatives:

1. Full case sensitivity in markup.

2. Case-folding using the Unicode rules.

We would like everyone on the WG who has a qualified opinion on this
issue to state which of these alternatives is preferable and why.

The following points should be borne in mind:

1. Option 1 has tremendous backward-compatibility implications.
Important as compatibility is, however, the primary purpose of this
activity is to enable standardized structured data to be generated and
served out to Web clients, not to preserve existing legacy data
unchanged.

2. The Unicode method folds uppercase to lowercase (see below).  ISO
8879 folds lowercase to uppercase; that is, it defines uppercase
substitution (Clause 13.4.5) but not lowercase substitution.

Due to the short amount of time remaining before we freeze the XML 1.0
specification, we need your input on this question as soon as
possible.

Jon

------------------------------------------------------------------------

From the Unicode Standard 2.0 (July 1996), Section 4.1:

   _Case_ is a normative property of characters in certain alphabets
   whereby characters are considered to be variants of a single
   letter.  These variants, which may differ markedly in shape and
   size, are called the _uppercase_ letter (also known as _capital_
   or _majuscule_) and the _lowercase_ letter (also known as _small_
   or _minuscule_).  The uppercase letter is generally larger than
   the lowercase letter.

   Because of the inclusion of certain composite characters for
   compatibility, such as U+01F1 LATIN CAPITAL LETTER DZ, there is a
   third case, called _titlecase_, which is used where the first
   character of a word is to be capitalized.  An example of such a
   character is U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z.
   The three case forms are UPPERCASE, Titlecase, lowercase.

   For those scripts that have case (Latin, Greek, Cyrillic, and
   Armenian), the case of a Unicode character can be obtained from
   the character's name.  This is true for only these four scripts.
   Uppercase characters contain the word _capital_ in their names.
   Lowercase characters contain the word _small._ The word _small_
   in the names of characters from scripts other than the four just
   listed has nothing to do with case. ...

   The lowercase letter default case mapping is between the small
   character and the capital character.  The Unicode Standard case
   mapping tables, which are informative, are on the CD-ROM [bound
   in with the standard].

   In a few instances, upper and lowercase mapping may differ from
   language to language between writing systems that employ the same
   letters....  [Examples cited are Turkish dotless "i" and French
   e-acute.]  However, in general the vast majority of case mappings
   are uniform across languages....

   Because there are many more lowercase forms than there are
   uppercase or titlecase, it is recommended that the lowercase form
   be used for normalization, such as when strings are case-folded
   for loose comparison or indexing.

   [Note:] Case correspondences are not always one-to-one: the
   result of case folding may be a different character length than
   in the source string (for example, U+00DF LATIN SMALL LETTER
   SHARP S [glyph for sz shown] becomes "SS" in uppercase).
Received on Sunday, 3 November 1996 23:16:10 UTC