- From: Jon Bosak <bosak@atlantic-83.Eng.Sun.COM>
- Date: Sun, 3 Nov 1996 20:14:10 -0800
- To: w3c-sgml-wg@w3.org
OK, internationalization experts, here's your chance to help out in a big way. The ERB has considered the issue of case sensitivity in markup at some length and has boiled down the options to just two alternatives: 1. Full case sensitivity in markup. 2. Case-folding using the Unicode rules. We would like everyone on the WG who has a qualified opinion on this issue to state which of these alternatives is preferable and why. The following points should be borne in mind: 1. Option 1 has tremendous backward-compatibility implications. Important as compatibility is, however, the primary purpose of this activity is to enable standardized structured data to be generated and served out to Web clients, not to preserve existing legacy data unchanged. 2. The Unicode method folds uppercase to lowercase (see below). ISO 8879 folds lowercase to uppercase; that is, it defines uppercase substitution (Clause 13.4.5) but not lowercase substitution. Due to the short amount of time remaining before we freeze the XML 1.0 specification, we need your input on this question as soon as possible. Jon ------------------------------------------------------------------------ From the Unicode Standard 2.0 (July 1996), Section 4.1: _Case_ is a normative property of characters in certain alphabets whereby characters are considered to be variants of a single letter. These variants, which may differ markedly in shape and size, are called the _uppercase_ letter (also known as _capital_ or _majuscule_) and the _lowercase_ letter (also known as _small_ or _minuscule_). The uppercase letter is generally larger than the lowercase letter. Because of the inclusion of certain composite characters for compatibility, such as U+01F1 LATIN CAPITAL LETTER DZ, there is a third case, called _titlecase_, which is used where the first character of a word is to be capitalized. An example of such a character is U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z. The three case forms are UPPERCASE, Titlecase, lowercase. For those scripts that have case (Latin, Greek, Cyrillic, and Armenian), the case of a Unicode character can be obtained from the character's name. This is true for only these four scripts. Uppercase characters contain the word _capital_ in their names. Lowercase characters contain the word _small._ The word _small_ in the names of characters from scripts other than the four just listed has nothing to do with case. ... The lowercase letter default case mapping is between the small character and the capital character. The Unicode Standard case mapping tables, which are informative, are on the CD-ROM [bound in with the standard]. In a few instances, upper and lowercase mapping may differ from language to language between writing systems that employ the same letters.... [Examples cited are Turkish dotless "i" and French e-acute.] However, in general the vast majority of case mappings are uniform across languages.... Because there are many more lowercase forms than there are uppercase or titlecase, it is recommended that the lowercase form be used for normalization, such as when strings are case-folded for loose comparison or indexing. [Note:] Case correspondences are not always one-to-one: the result of case folding may be a different character length than in the source string (for example, U+00DF LATIN SMALL LETTER SHARP S [glyph for sz shown] becomes "SS" in uppercase).
Received on Sunday, 3 November 1996 23:16:10 UTC