Re: IDs - make them case sensitive from Rick Jelliffe on 1997-06-27 (w3c-sgml-wg@w3.org from June 1997)

From: Rick Jelliffe <ricko@allette.com.au>
Date: Sat, 28 Jun 1997 04:00:16 +1000
To: <w3c-sgml-wg@w3.org>
Message-Id: <199706271823.EAA11037@jawa.chilli.net.au>
> From: Bert Bos <bert@w3.org>

> A proposal:
> 
> * Make the values of ID attributes case-sensitive

It is a fair proposal, and a well-troden path. The basic reason
why not is that it introduces a whole new class of frustrating 
errors for users, especially users coming from the PC world, who
assume that identifiers are case-insensitive as a matter of course.
 
> The HTML WG recently studied the issue. In summary:
> 
>   - current browsers don't consider <A NAME="xxx"> to be a
>     target for <A HREF="#XXX">

In SGML, in most DTDs entity references are case sensistive
(to get a difference between &Aumlaut; and &aumlaut;), but
other names (e.g. the GIs of elements aren't).  Of course,
SGML lets you pick the case behaviour, and after 10 years
people still predominantly choose case-insensitive names.

>   - there is no way to determine the language of an ID,
>     therefore the case-mapping rules aren't known either.
>     Any mapping rule will surprise some people.

Software that is predictable will surprise some people :-)

The Unicode 2.0 book (p.4-2) notes "in general, the vast majority
of case mappings are uniform accross languages".  

We discussed the issue (at WG8 in 1995) of whether SGML 
should allow more sophisticated kinds of case mapping 
(e.g. many to single), and we found that no national body 
wanted to request it. The  general consensus was that 
to have more complex rules, just to handle a few abberrant 
case mappings was not worth complicating the SGML declaration 
more over.  And especially if each problem occurred for 
only a single nation's script, and if each needed
a different syntax.

>   - case-sensitivity is easy to explain and avoids
>     surprises (e.g., people find it easy to see a
>     difference between A and a, much easier than between
>     full-width/half-width letters in Japanese, or
>     precomposed letters and floating accents, e.g....)

In the particular case of the full-width and half-width 
alphabets and katakana, only one set is allowed in the 
SGML declaration I have sent in. (All compatability zone
characters are dropped.)
 
>   - case-insensitive mapping is hard to implement;
>     it needs a few dozen Kb of tables in Java.

We are using the standard Unicode case-mapping, I believe.
JDK 1.1 comes with Unicode case functions in java.lang.character.

>   - the repertoire of Unicode/ISO-10646 is open-ended: more
>     letters will be added later, but with case-insensitive
>     mapping, the implementations won't have to change.

I have always pushed that XML naming rules stick with just the
characters that appear in common words and are in non-surrogate, 
non-compatability zone Unicode.  Native Language Markup does not
demand that all words be available, just a minimum set.  

> The well-known problem cases are the dotless-i of Turkish, the
sharp-s
> of German, the uncertainty over dropping accents from uppercase
letters
> in French.

But Germans know they have a problem with computers and the sharp-s,
and
Turks know their eyes are dotty. I don't know what the French think
about
their accents, but I'd imagine it is the same.  

As far as the Turkish I goes, I'd much prefer to say they all four are
the 
same character.  But we should defer to the Unicode people: it is their
game.

In English, we don't really complain because we cannot have spaces 
in our names, even though it produces somewhat artificial strings;
the Germans may smile at us pityingly and pat themselves on their
backs for having  an agglutinating language.  In English we care that
we can use alphabetics and that we can at least make sense of our 
identifiers.   

I think people know that universality gives more immediate results,
even
if there are still a few bumps for most nations. 


Rick Jelliffe
Received on Friday, 27 June 1997 14:24:09 UTC