- From: Jonathan Marsh <jmarsh@microsoft.com>
- Date: Fri, 23 Feb 2001 17:41:25 -0800
- To: <www-i18n-comments@w3.org>
The Character Model Working Draft proposes early uniform normalization of XML documents on the web [1]. This is intended mainly to solve the problem of string identity matching [2]. We believe that the proposal relies on an unenforceable social contract, and falls short of a complete solution. These are our concerns: 1) The model proposed is opposite to the successful model currently in use on the web, which is that the consumer be prepared to accept inputs from producers of unknown quality. For HTML browsers "bad" data is subjected to whatever fixup necessary to display something. For XML "bad" data is rejected by strict well-formedness and possibly validity checks. If we could trust all producers to supply well-formed XML and valid HTML, well-formedness checks or structural fix-ups would be unnecessary. Conversely, why should a consumer trust that producers will universally supply correctly normalized data when the consumer can't trust them to produce well-formed XML or valid HTML? 2) The model requires a level of trust in producers. Unnormalized content may result in errors when for some reason this trust is violated. As a social contract rather than a technical one, early normalization is impossible to enforce. 3) Without enforcement, the problems the early uniform normalization purports to solve are not in fact solved. The assumption is that there is currently a small amount of unnormalized data that has the potential to mess up applications. Early uniform normalization seems to be designed to keep the amount of unnormalized data small, not to deal with the consequences of such data. In fact it may hinder dealing with unnormalized data, since it precludes certain strategies for coping with it (namely, late normalization). 4) The performance penalty imposed upon producers is severe for high-speed XML generators (speed) and constrained environments (memory). Since the model is unenforceable and doesn't offer a complete solution, there is little incentive for a particular product to incur this penalty. 5) The character model should consider augmenting early uniform normalization with enforcement by consumers. For instance, an XML processor could reject unnormalized input in the same way it rejects well-formedness violations. Not only does this protect applications from errors to some extent, but it provides the most powerful type of encouragement for producers to normalize early. Assuming that verifying correct normalization is much cheaper than normalizing itself, this seems like a reasonable compromise on the perf side too. Did the I18N group consider this? A single surgical change to XML 1.0 to fail unnormalized documents as mal-formed would go an incredibly long way to encouraging proper behavior by producers (including HTML as it evolves into XHTML). 6) We note that use of a text editor which outputs NFC-normalized does not guarantee that W3C-normalized XML or HTML is produced, because of the necessity to rewrite entities as illustrated in [3]. Thus, the responsibility for creating W3C-normalized text cannot be fully placed upon text editors, but requires specific knowledge (in this case the individual author) of normalization constraints. Suggestions that most products already produce NFC-normalized output does not apply to W3C-normalized XML and HTML created with text editors. - Jonathan Marsh Microsoft [1] http://www.w3.org/TR/charmod/#sec-Normalization [2] http://www.w3.org/TR/WD-charreq#2.1 [3] http://www.w3.org/TR/charmod/#sec-TextNormalization
Received on Friday, 23 February 2001 20:42:04 UTC