Re: Case sensitivity in markup from Martin Bryan on 1996-11-04 (w3c-sgml-wg@w3.org from November 1996)

From: Martin Bryan <mtbryan@sgml.u-net.com>
Date: Mon, 4 Nov 1996 09:10:35 +0000
To: bosak@atlantic-83.Eng.Sun.COM (Jon Bosak), w3c-sgml-wg@w3.org
Message-Id: <96Nov4.091036+0000_gmt.40574-18398+970@mail.u-net.net>
At 20:14 3/11/96 -0800, Jon Bosak wrote:
>OK, internationalization experts, here's your chance to help out in a
>big way.
>
>The ERB has considered the issue of case sensitivity in markup at some
>length and has boiled down the options to just two alternatives:
>
>1. Full case sensitivity in markup.
>
>2. Case-folding using the Unicode rules.
>
>We would like everyone on the WG who has a qualified opinion on this
>issue to state which of these alternatives is preferable and why.
>
>The following points should be borne in mind:
>
>1. Option 1 has tremendous backward-compatibility implications.
>Important as compatibility is, however, the primary purpose of this
>activity is to enable standardized structured data to be generated and
>served out to Web clients, not to preserve existing legacy data
>unchanged.
>
>2. The Unicode method folds uppercase to lowercase (see below).  ISO
>8879 folds lowercase to uppercase; that is, it defines uppercase
>substitution (Clause 13.4.5) but not lowercase substitution.

My initial reaction to the Unicode tables you so thoughtfully supplied is
that there is nothing to stop them being used for lowercase to uppercase
substitution as for every lowercase character the uppercase equivalent is
identified where relevant. I note that for characters such as &szlig;, which
is not a one-to-one mapping, no entry is specified in the table, so
presumably the lowercase form would always be retained. The Unix tables also
provide a useful form of normalization for the conversion of
letter+diacritic pairs into "normalized" characters.

Your question really boils down to "what does compatibility mean".

If we want comatibility with HTML then we need to allow tags to be entered
in both shifts, and be able to match a cap start-tag with a lowercase
end-tag for any element.

For SGML compatibility we could presume that the file is normalized prior to
being passed to the web as an XML file. Such normalization would turn any
lowercase name characters to caps if NAMECASE GENERAL YES was in force. We
know we have to change the SGML DTD to make make it compatible for XML. One
such change, therefore, would be that all element, attribute, etc, names
(except entity names) would have to be capitalized. Using this simple rule
the default output for SGML files would be the capitalized names, and we
could use option 1 within XML.

If we adopt a case sensitive approach for XML, HTML files that have been
normalized using the rules applied for existing SGML files could also be
treated as if they would have caps only markup.

New XML files should not be a problem if we apply the case sensitive
approach, especially if we recommended use of lowercase characters for
names. This would have the advantage of clearly distinguishing normalized
SGML and HTML files from specially created XML files, and would allow us to
maximize the number of characters available for markup.

The problem with option 2 is that it is unclear what the rules are for
converting from uppercase to lowercase are. In a French word, for example,
you cannot presume that a cap E will convert to a lowercase e on conversion;
it may well need to be converted to an accented character, depending on the
word it has been used in.

A third option would be to use the Unicode tables to convert lowercase
characters to their cap equivalents, reducing the number of markup
characters to those for which an equivalent has been defined. This would
probably lead to fewer anomalies than converting caps to lowercase, given
that most accented characters have cap equivalents within 10646, but would
lead to anomalies due to locale-specific changes in the way capitalization
is done. Whilst this approach would have the advantage of making XML
backward compatible with the existing version of SGML I'm not convinced that
it would remain compatible for SGML97++.

The one question I have is "Why the rush?" Given that WG8 may discuss this
issue next week, while i18n will discuss it the week after, could we not
hold this discussion off for a fortnight to allow a more detailed discussion
of it. I would particularly like to raise this question at the i18n
conference in Seville, probably on November 21st when the problems of the
internationalization of HTML are due to be discussed. It seems silly to
force XML members to answer this question before the acknowledged experts in
the field meet to discuss it.

Martin Bryan
----
Martin Bryan, The SGML Centre, Churchdown, Glos. GL3 2PU, UK 
Phone/Fax: +44 1452 714029   WWW home page: http://www.u-net.com/~sgml/
Received on Monday, 4 November 1996 04:13:29 UTC