- From: <lee@sq.com>
- Date: Fri, 25 Oct 96 00:58:17 EDT
- To: w3c-sgml-wg@w3.org
Some notes on case sensitiveness, and arguing that XML should use one of (1) US ASCII for all SGML names and ID values, or (2) NAMECASE NO -- i.e., all names case to be sensitive. Note that URLs are case sensitive today, including (I think) the #Percy notation used in HTML to find an "id" such as <A name="Percy">. HTML users are thus used to case sensitive attributes. (some elements and other attributes used to be case sensitive because of Mosaic/Netscape bugs, by the way, but I think they are all fixed now). Now... Case insensitivity is not well defined for Unicode/ISO 10646 as a whole and really only makes sense when you have a specific language -- but cross references from a French section to a Swedish section of a document might then have different rules for case sensitivity (whether accents are retaind in upper case, for example). The POSIX model is being extended, but POSIX does not deal with multilingual documents or environments. For example, in a POSIX regular expression, [:lower:] matches lower case letters as determined by the current locale. The POSIX model does not address multilingual environments at all, and explicitly says so. Thus, if you are a Frenchman in Paris (say) reading a French document from Quebec, you can only set LOCALE to one value, despite the different capitalisation rules in Quebec (accents are retained, E is not the same as E-acute) and France (accents are dropped, e and e acute map to E). So an id of "m&ehat;l&eague;e" maps to MELEE in France, but to ME^LE'E (I'll put the accents like this rather than using 8-bit characters) in Quebec, so that in France an ID of "MELEE" would be matched by the lower case IDREF I have given, but not in Quebec. So ISO 10646 values are best if they are case sensitive, so the parser does not have to understand the complex rules for mapping equivalence classes. There is a discussion of this right now on the www-internationalisation list, and also in www-style inasmuchas the HTML CLASS attribute may have to be case sensitive. In SGML, we have 13.4.5 Naming Rules, where the SGML process of case conversion is described. It doesn't work in the cases I have described, which are not uncommon, especially with distributed authoring over a network... In particular, you can specify an SGML declaration when you create the document, such that capitalisation works, as long as the entire document is authored in the same language. Worse, you are not allowed to include lower case letters or upper case letters in LCNMCHAR and friends, so you can't map e acute to E (or am I mistaken??). (LC Letter and UC Letter are defined in Figure 1 on p. 29 as a-z and A-Z; upper and lower case letters are defined in section 4; the restriction is in section 13.4.5 just after the NAMECASE definition) If name conversion can't be made to work correctly, don't do it at all. There are no surprises that way. SGML users may be confused, perhaps -- but no more so than with <e/>, which, although it has not been accepted (and I still prefer <@e>!!), is more incompatible with an unmodified SGML declaration. Lee
Received on Friday, 25 October 1996 10:10:33 UTC