RE: [draft] Unicode Normalization: requsts for CSS-WG, HTML-CG agendum

Hello Martin,

Thanks for your comments. Some *personal* observations follow.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: Martin Duerst [mailto:duerst@it.aoyama.ac.jp]
> Sent: Sunday, February 08, 2009 12:11 AM
> To: Phillips, Addison; public-i18n-core@w3.org
> Subject: Re: [draft] Unicode Normalization: requsts for CSS-WG,
> HTML-CG agendum
> 
> I haven't had time to read all the messages in the recent theads,
> but I have to disagree with your conclusions.
> 
> First, I don't think that after having worked on this issue for
> more than 10 years (for the I18N WG, it started at the WWW
> Conference
> in Brisbane in 1998), with much meandering of the WG itself
> (from a strong focus on early normalization initially to the
> stress on very complex definitions such as "fully normalized"
> for top security to the recent apparent trend for "late matching"),
> without much concrete results (convincing other WGs and
> implementers),
> but also without much concrete damage, we can suddenly claim that
> this is now urgently needed, and get everybody convinced.

I don't think that we claim particular urgency for the issue. However, to do nothing is to make a decision. No matter what happens, in my view, we are near the end of our very long meander in the world of normalization. With neither a requirement for EUN nor requirements for "middle" or "late" normalization in affected specifications, the most we can hope for is support for normalization routines in various environments and then education of affected user populations.

I should note that this shouldn't be all that surprising a conversation. The I18N WG's position on normalization officially changed from trying to require EUN to trying to address "late" normalization in early 2005 (almost exactly four years ago).

The real crux of this problem is a fundamental disconnection between Unicode on the one hand and XML (insert 'css' or 'html' here if you prefer) on the other. Unicode defines canonical equivalence and it seems clear to me that, in most cases, canonical equivalents are meant to be recognized as somehow the "same". XML 1.0 relies on Unicode, fundamentally, and so should have dealt with the normalization issue definitively too. Ideally canonically equivalent sequences would be treated the same and XML documents would be "fully normalized" (that is, not just that the file containing the XML is normalized, but also the resulting DOM tree and all its bits are too--but possibly not includes [that's "include normalized"]).

Occasional use of SHOULD for NFC doesn't resolve this issue. It has to be done explicitly or two canonically equivalent but different character sequences are, de facto, different. Since it is not actually required (which is what we tried to do with CharMod) and since making it so for (fill-in-the-blank) standard has been consistently a non-starter, I18N WG concluded that EUN could not be done and we could either surrender utterly or attempt to deal with each normalization problem piecemeal. We chose the latter course. You know about this: you were the Activity lead and I was the chair when this happened.

> 
> Second, I think that if we want to send a message like the one
> below, we have to be much more specific about actually affected
> languages and other details.

Which we've done separately elsewhere. It should be in my email.

> 
> Third, I don't believe that late matching as proposed below is
> the right solution. It's much easier to say that content should
> be in NFC. Where it's relevant, the right tools will emerge.
> Editors that are currently not able to output NFC will be
> changed to do so, validating and fixup tools (even Web-based)
> will emerge if there is a need, and so on. It can be done
> one-by-one, and has a higher guarantee for success than
> trying to convince every HTML, CSS, XML,... tool in the
> market to do late matching.
> 

If one cannot assume early uniform normalization, then each specification will need to consider the potential ramifications of normalization. It isn't the case that each specification "must" mandate normalization. It might be enough to point out to users that normalization is required if one wishes to use certain features in a certain way. But if normalization is not happening, then users must deal with the results. When I18N started considering this issue, back in 1998, de-normalized data was mostly theoretical. Now we have evidence of languages where it isn't at all theoretical as a problem.

But I am also not convinced that requiring normalization in certain contexts is a bad thing. Normalizing identifiers during a matching operation will produce the "do what I mean" behavior that users would expect. It may be maximally inconvenient for browser vendors (as it appears to be), but it may not be the "wrong" thing for us to be recommending. In the case of Selectors, where different document types and access methods interact (you can call a Selector from JavaScript or from a CSS file---both of which might be maximally removed from the original HTML author's "normalization intent"). Easiest is not always best.

Still, to be completely honest, I expect to lose this "normalization battle". I don't think, any more than you do, that we can convince HTML, CSS, XML, etc. etc. to mandate any kind of normalization anywhere at this late date. The affected speaker populations aren't aware of the issue(s) and are small on the overall scale of affected users. Effectively, too much software and too much content has been written already. Unless Hixie tomorrow declares his undying love of NFC, I just don't see it happening.

Other automatic normalization options are not that enticing either. I think that normalizing documents during the parse step would actually be the most disruptive: there are really reasons to use de-normalized text in one's content and this would effectively destroy applications that rely on de-normalized sequences. Since such sequences are legal and considered to be different by the Ur-standard... there's a problem!

I completely agree that users (sometimes/often/occasionally) know best about what character sequence to use for their data. I have no problem with the user-agent normalizing text for some operation entirely internal to it (such as matching the DOM tree), especially if that normalization-like behavior is somehow mandated, but I have some problem with the user-agent re-writing my content in a way that exposes the changes effectively everywhere. This, of course, assumes that normalization is "legal but not required".

With EUN merely a "recommendation" and late normalization "too costly", that doesn't leave us with any solution except for Best Practices and the occasional convenience method/function. This isn't necessarily wrong (or right). It just is. If you look at my personal contributions to this thread, you'll see that consistently I have insisted that decisions about normalization should be made with eyes open and not by default. To that end, I think that discussing CSS Selectors is a very useful thing to do.

Finally, I take issue with the statement "[w]here it's relevant, the right tools will emerge". This is demonstrably false. It has not happened to date and, with the growth of greater and more elaborate Web technologies, is unlikely to occur spontaneously. It's hard enough to get the tools to be standards compliant, after all. 

So:

- either EUN happens
- or affected specs mandate normalization in specific cases ("late normalization")
- or normalization is the responsibility of users (health warnings for relevant places in specs)

What's it going to be?

Received on Sunday, 8 February 2009 22:37:05 UTC