Re: [CSS21][css3-namespace][css3-page][css3-selectors][css3-content] Unicode Normalization from Robert J Burns on 2009-02-05 (public-i18n-core@w3.org from January to March 2009)

From: Robert J Burns <rob@robburns.com>
Date: Thu, 5 Feb 2009 13:30:14 -0600
To: public-i18n-core@w3.org, W3C Style List <www-style@w3.org>
Message-Id: <FD91BCF8-40ED-4FBB-AFD2-77A99C8D681A@robburns.com>
Hi Henri,

> On Feb 5, 2009, at 01:32, Jonathan Kew wrote:
>
> > On 4 Feb 2009, at 12:36, Henri Sivonen wrote:
> >
> >> Before anyone accuses me of “Western bias”, I'd like to point
> >> out that there is precedent for preferring (in my opinion quite
> >> reasonably) a faster kind of Unicode string equality relation over
> >> one that could be claimed to make more sense to users of Western
> >> languages: XML deliberately uses code point for code point string
> >> comparison for start/end tag names instead of Unicode-wise correct
> >> case-insensitive comparison for performance reasons (and to avoid
> >> having to pick a collation locale). (I'm assuming here that we can
> >> agree that bicameral scripts are big in the “West” and case-
> >> insensitive equality makes some intuitive sense for the users of
> >> bicameral scripts.)
> >
> > The choice of case-sensitivity vs case-insensitivity is not really
> > comparable.
>
> My point is that it's generally not helpful to bring out the Western
> bias[1] thing in discussions of using Unicode in computer languages.
> Previously, too, performance has been preferred over full natural
> language complexity for computer language identifier equality
> comparison and in that instance clearly it could not have been an
> issue of Western bias. The thing is that comparing computer language
> identifiers code point for code point is the common-sense thing to do.
> If you consider the lack of case-insensitivity, some languages are not
> perfectly convenienced. If you consider the lack normalization,
> another (overlapping) set of languages is not perfectly convenienced.
> If you consider the sensitivity to diacritics, yet another set of
> languages is not perfectly convenienced. No language is prohibited by
> code point for code point comparison, though.

However, what makes the case-sensitivity example a really bad example  
of where there's no Western bias is as I have already said: we have  
already dealt unambiguously with case-sensitivity in HTML, in XML, in  
CSS, etc. Those specifications all make it clear how authors and  
implementations should deal with case. With canonically equivalent  
string matching that has not been done. You're here saying it should  
just be handled by authors by producing NFC content, case closed. Well  
even if that was the way that it should be dealt with (and I'm not  
saying it is), the case can't be closed until every recommendation  
from the W3C includes normative language to that effect.

>
> > It's true that most current programming and markup languages are
> > case sensitive, I think, although this is not universal. (What does
> > the HTML spec say? The UAs I'm accustomed to seem to treat it as
> > case-insensitive.)
>
> HTML5 says the parser replaces A through Z with a though z and
> thereafter the comparisons are done code point for code point. This is
> done for backward compatibility. Doing it this way works, because the
> conforming HTML vocabulary is entirely in the Basic Latin range.
>
> Also, doing it this way avoids the problem of sneaking scripts past
> ASCII-oriented black list-based gatekeepers by writing <SCRİPT>.

So HTML can take a performance hit like that for case sensitivity, but  
for canonical normalization it would be an undue burden. How is that  
not Western bias?

> > Similarly with names in filesystems: both case-sensitive and case-
> > insensitive systems are in widespread use. There is presumably a
> > performance win for those that are case sensitive, but it doesn't
> > appear to be a compelling criterion for system developers.
>
> File names are exposed to all end users. However, class names and
> selectors are only exposed to a step more technically savvy group of
> people who deal with code.

Class names and identifiers have become an important part of authoring  
web content. All levels of authors are going to likely deal with class  
names and identifiers. Very few authors in my opinion should ever even  
need to know about the concept of canonical equivalence. It's a  
Unicode implementation detail that authors simply shouldn't be  
burdened with.

> (Aside: Letting the non-NFC not-quite-NFD form of HFS+ file names to
> leak to e.g. URIs is pretty annoying.)
>
> > However, the key difference (in my mind, at least) is that for
> > bicameral Western scripts, the user can clearly see the difference
> > between upper and lower case.
>
> Sure, but that's not part of the point I was making.

Yes it wasn't part of the point you were making, but it is relevant to  
your analogy. The obvious difference in case is an important part of  
how the W3C standards have dealt with case sufficiently of even  
completely. The hidden nature of canonical equivalence means that we  
can't rely on authors recognizing normalization problems simply by  
examining the text.

> > For a better analogy, imagine having to work with a language like
> > HTML or XML or CSS, but where the distinction between lines
> > terminated by CR, LF, or CR/LF is significant to the interpretation
> > of your markup. Now imagine sharing files and fragments with co-
> > workers using different tools on different platforms, where the
> > usual line-end conventions vary. Some of your editors may show the
> > different kinds of line differently, perhaps, but many will display
> > them all identically. That's not a scenario I would wish on anyone.
> > I don't want "human-readable" source code that has to be hex-dumped
> > before I can be sure of its meaning.
>
> CRLF to LF normalization is implementation-wise so troublesome and
> annoying in performance-critical code that if it is an analogy for
> anything, the conclusion should be firmly against introducing any more
> things that are analogous with CRLF.

No on is suggesting we introduce another thing like CRLF  
normalization. We're past that point. The canonical normalization  
thing has already been introduced. Burying our heads in the sand will  
not make it go away. I don't think I've heard anyone here saying that  
canonical normalization is no big deal to handle. Frankly I think it  
is a mess.[1] This is a place where Unicode really dropped the ball  
and has left things in the hands of others to straighten out.

Ideally Unicode would deal with this or some W3C/Unicode liaison. We  
need clearer norms to font vendors/authors; clearer norms on input  
systems and clearer norms on canonical string comparison. However, I  
find it hard to imagine how parser stage normalization could not be an  
essential part of any strategy to fix this situation. Just as CRLF  
normalization was the way disparate line-ending conventions got dealt  
with.

> I'm sure normalizing CRLF to LF seems like no big deal conceptually,  
> but implementing e.g. CRLF to LF normalization where the CR comes  
> from document.write() and the LF from the network stream gets more  
> complex than it first appears if you also want to maintain sane  
> buffering that doesn't copy data too many times.

These issues are no doubt complex. However, the problem is that if  
they are not dealt with at the parser level (and in other tools), they  
create immensely more complicated problems that are even more  
difficult to deal with for every author throughout the entire World.  
No one here is expecting you personally to solve these problems in  
every parser ever written. However, it is something that needs to be  
dealt with at that stage for the sanity of every author on the planet.

Take care,
Rob

[1]: For example there are characters that are seemingly semantically  
distinct to authors that are lossily removed from both NFC and NFD.  
Such characters probably need to be deprecated because of their  
longstanding status as canonical equivalents and yet included in input  
systems and elsewhere in ways indistinguishable from other usable  
characters.  These are: Angstrom (Å U+212B: use U+00C5 instead), Ohm  
(Ω, U+2126: use U+03A9 instead), Kelvin (K U+212A: use U+004B instead)  
and  Prosgegrammeni (ι U+1FBE) many of which even use different glyphs  
in my fonts.
Received on Thursday, 5 February 2009 19:30:58 UTC