[Prev][Next][Index][Thread]

A note on case sensitivity



Some notes on case sensitiveness, and arguing that XML should use one of
(1) US ASCII for all SGML names and ID values, or
(2) NAMECASE NO -- i.e., all names case to be sensitive.

Note that URLs are case sensitive today, including (I think) the #Percy
notation used in HTML to find an "id" such as <A name="Percy">.
HTML users are thus used to case sensitive attributes.

(some elements and other attributes used to be case sensitive because of
Mosaic/Netscape bugs, by the way, but I think they are all fixed now).

Now...

Case insensitivity is not well defined for Unicode/ISO 10646 as a whole
and really only makes sense when you have a specific language -- but
cross references from a French section to a Swedish section of a
document might then have different rules for case sensitivity
(whether accents are retaind in upper case, for example).

The POSIX model is being extended, but POSIX does not deal with
multilingual documents or environments.  For example, in a POSIX
regular expression, [:lower:] matches lower case letters as determined
by the current locale.  The POSIX model does not address multilingual
environments at all, and explicitly says so.

Thus, if you are a Frenchman in Paris (say) reading a French document
from Quebec, you can only set LOCALE to one value, despite the different
capitalisation rules in Quebec (accents are retained, E is not the
same as E-acute) and France (accents are dropped, e and e acute map
to E).  So an id of "m&ehat;l&eague;e" maps to MELEE in France, but
to ME^LE'E (I'll put the accents like this rather than using 8-bit
characters) in Quebec, so that in France an ID of "MELEE" would
be matched by the lower case IDREF I have given, but not in Quebec.

So ISO 10646 values are best if they are case sensitive, so the
parser does not have to understand the complex rules for mapping
equivalence classes.  There is a discussion of this right now on the
www-internationalisation list, and also in www-style inasmuchas the
HTML CLASS attribute may have to be case sensitive.

In SGML, we have 13.4.5 Naming Rules, where the SGML process of
case conversion is described.

It doesn't work in the cases I have described, which are not uncommon,
especially with distributed authoring over a network...

In particular, you can specify an SGML declaration when you create the
document, such that capitalisation works, as long as the entire document
is authored in the same language.  Worse, you are not allowed to include
lower case letters or upper case letters in LCNMCHAR and friends, so
you can't map e acute to E (or am I mistaken??).  (LC Letter and UC Letter
are defined in Figure 1 on p. 29 as a-z and A-Z; upper and lower case
letters are defined in section 4; the restriction is in section 13.4.5
just after the NAMECASE definition)

If name conversion can't be made to work correctly, don't do it at all.
There are no surprises that way.

SGML users may be confused, perhaps -- but no more so than with <e/>,
which, although it has not been accepted (and I still prefer <@e>!!),
is more incompatible with an unmodified SGML declaration.

Lee