- From: Martin Bryan <mtbryan@sgml.u-net.com>
- Date: Mon, 4 Nov 1996 09:10:35 +0000
- To: bosak@atlantic-83.Eng.Sun.COM (Jon Bosak), w3c-sgml-wg@w3.org
At 20:14 3/11/96 -0800, Jon Bosak wrote: >OK, internationalization experts, here's your chance to help out in a >big way. > >The ERB has considered the issue of case sensitivity in markup at some >length and has boiled down the options to just two alternatives: > >1. Full case sensitivity in markup. > >2. Case-folding using the Unicode rules. > >We would like everyone on the WG who has a qualified opinion on this >issue to state which of these alternatives is preferable and why. > >The following points should be borne in mind: > >1. Option 1 has tremendous backward-compatibility implications. >Important as compatibility is, however, the primary purpose of this >activity is to enable standardized structured data to be generated and >served out to Web clients, not to preserve existing legacy data >unchanged. > >2. The Unicode method folds uppercase to lowercase (see below). ISO >8879 folds lowercase to uppercase; that is, it defines uppercase >substitution (Clause 13.4.5) but not lowercase substitution. My initial reaction to the Unicode tables you so thoughtfully supplied is that there is nothing to stop them being used for lowercase to uppercase substitution as for every lowercase character the uppercase equivalent is identified where relevant. I note that for characters such as ß, which is not a one-to-one mapping, no entry is specified in the table, so presumably the lowercase form would always be retained. The Unix tables also provide a useful form of normalization for the conversion of letter+diacritic pairs into "normalized" characters. Your question really boils down to "what does compatibility mean". If we want comatibility with HTML then we need to allow tags to be entered in both shifts, and be able to match a cap start-tag with a lowercase end-tag for any element. For SGML compatibility we could presume that the file is normalized prior to being passed to the web as an XML file. Such normalization would turn any lowercase name characters to caps if NAMECASE GENERAL YES was in force. We know we have to change the SGML DTD to make make it compatible for XML. One such change, therefore, would be that all element, attribute, etc, names (except entity names) would have to be capitalized. Using this simple rule the default output for SGML files would be the capitalized names, and we could use option 1 within XML. If we adopt a case sensitive approach for XML, HTML files that have been normalized using the rules applied for existing SGML files could also be treated as if they would have caps only markup. New XML files should not be a problem if we apply the case sensitive approach, especially if we recommended use of lowercase characters for names. This would have the advantage of clearly distinguishing normalized SGML and HTML files from specially created XML files, and would allow us to maximize the number of characters available for markup. The problem with option 2 is that it is unclear what the rules are for converting from uppercase to lowercase are. In a French word, for example, you cannot presume that a cap E will convert to a lowercase e on conversion; it may well need to be converted to an accented character, depending on the word it has been used in. A third option would be to use the Unicode tables to convert lowercase characters to their cap equivalents, reducing the number of markup characters to those for which an equivalent has been defined. This would probably lead to fewer anomalies than converting caps to lowercase, given that most accented characters have cap equivalents within 10646, but would lead to anomalies due to locale-specific changes in the way capitalization is done. Whilst this approach would have the advantage of making XML backward compatible with the existing version of SGML I'm not convinced that it would remain compatible for SGML97++. The one question I have is "Why the rush?" Given that WG8 may discuss this issue next week, while i18n will discuss it the week after, could we not hold this discussion off for a fortnight to allow a more detailed discussion of it. I would particularly like to raise this question at the i18n conference in Seville, probably on November 21st when the problems of the internationalization of HTML are due to be discussed. It seems silly to force XML members to answer this question before the acknowledged experts in the field meet to discuss it. Martin Bryan ---- Martin Bryan, The SGML Centre, Churchdown, Glos. GL3 2PU, UK Phone/Fax: +44 1452 714029 WWW home page: http://www.u-net.com/~sgml/
Received on Monday, 4 November 1996 04:13:29 UTC