- From: Jim Melton <jim.melton@oracle.com>
- Date: Mon, 24 Nov 2003 16:27:29 -0700
- To: public-qt-comments@w3.org
- Message-Id: <6.0.0.22.2.20031124144723.0681f670@gmstimap.oraclecorp.com>
Gentlepeople, In [1], François Yergeau (in response to Ashok Malhotra's comment copied in his response) observed: The Serialization document does indeed mention *performing* normalization on output (during serialization), but comment [83] also requests *checking* normalization on input. The "comment [83]" to which François' observation refers is found in [2]. That comment (a formal comment by the W3C's Internationalization WG on the Functions & Operators Last Call WD) reads: [83] 7.4.11 normalize-unicode: Maybe not as a function, but in any case somehow, normalization checking on input and normalization on output should be available in both XQuery and XSLT, on full XML constructs (with the relevant definitions form XML 1.1) The issue being addressed has been under discussion for a considerable time. Consider the following text taken from [3], which was transmitted to the Internationalization (I18n) WG on 2001-02-21: c) While early normalization has advantages for applications and for end users, the XML Query WG has concerns with the costs (both runtime overhead and development costs) associated with mandating it. We also believe that it would be appropriate for the Character Model to permit conforming implementations to provide a mechanism by which applications can explicitly request normalization in situations where those applications know it will be needed. Subsequent discussions between the Query WG and the I18n WG demonstrated that there are seriously different views of the problem and the requirements. Some individuals felt very strongly that WWW interoperability absolutely required that every processor follow the rule that every text string must be normalized (using NFC) when it is created and that no processor normalize any string that it is processing. Other individuals strongly believed that there is an enormous legacy of data that is not normalized and thus certain processors (if not all) cannot depend on receiving data that is already normalized; such processors must (according to this belief) be allowed to perform normalization on such data. Another W3C private communication [4] from an active member of the I18n WG included the observation: Assuming further that most data happens to be normalized and further, that most comparisons don't need to look at all the data, I'm not convinced that a comparison that performs a parallel quick check that the portion of the data actually used is normalized can't be made to execute much faster than if the whole data had been normalized This observation was eventually characterized as "a processor can check whether data is normalized rather cheaply, and thus normalize only that data that is not normalized". Further communication from the I18n WG to the Query WG in [5] included the following paragraphs: We arrived at prohibition against text normalization on the receiver side mainly by considering situations where an XML document is sent, and where we have to make sure that all recipients interpret it the same way. But it seemed to us that you might be thinking about a database, and that you think it would make a lot of sense to offer text normalization for data that enters the database, not the least to make sure that it is normalized when it leaves the database (as the character model requires if we, quite reasonably, consider the database a producer in that case). Indeed, the use of XQuery in a database context was one basis for the Query WG's position that early normalization was not reliable and thus that processors should be allowed to normalize when required. The subsequent version of the Character Model [6] continued the prohibition of consumer processors performing normalization. The Query WG continued to oppose that prohibition, in no small part because it appeared to require that those consuming processors that encountered non-normalized data to completely reject the data. That, the Query WG felt, was counter to users' expectations both of data access and of the Web. The Query WG also opposed a requirement in [6] that every individual character string operation performed by a processor produce normalized results; the opposition was partly on the grounds of performance (constant normalizing can be rather CPU-intensive), but also on the grounds that internal behavior of a processor is not subject to external imposition of arbitrary rules (that is, it's none of their business). As a result of the Query WG's additional comments, the I18n WG's official response to the concerns about a prohibition of consuming processors performing normalization was that the comment and proposed solution (no such prohibition) was accepted for the next draft of [6]. In particular, the I18n WG accepted the Query WG's provision of a function that allows applications to explicitly request normalization at any time, including normalization of a suspect document when it is first encountered. Additional telephone discussion between a leading member of the I18n WG and a member of the XSL WG resulted in the I18n WG confirming that the next edition of [6] would relax the prohibitions against consuming applications normalizing internal data. They also agreed that certain processors, such as XSLT, required the ability to *produce* non-normalized data. (That last item does not necessarily apply to XQuery or to XPath 2.0.) The most recent version of [6], found in [7], does indeed relax most of those prohibitions. Unfortunately, it still contains the prohibition against a consuming application actually normalizing data. I believe that the Query WG is likely to interpret this prohibition as a prohibition against a consuming processor implicitly normalizing data, but not against explicit, user-requested normalization. That might be a little bit sneaky (or downright cynical), but it's the only way that I, and the Query WG, can see forward into rationalizing both the Query WG's own requirements and the I18n WG's efforts to make the Web truly interoperable. SUMMARY: It is highly unlikely that the Query WG will modify XQuery to mandate that all input it receives be NFC-normalized, but that the Query WG will provide a function by which applications (queries) can explicitly force normalization of input documents. I hope that this very lengthy history explains why the specifications are presently as they are and why there is no significant likelihood that this situation will be changed. Hope this helps, Jim [1] http://lists.w3.org/Archives/Public/public-qt-comments/2003Oct/0026.html [2] http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0106.html [3] http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001Feb/0247.html (an internal, and thus private, W3C communication) [4] http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001Feb/0411.html (an internal, and thus private, W3C communication) [5] http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2002Jan/0342.html (an internal, and thus private, W3C communication) [6] http://www.w3.org/TR/2002/WD-charmod-20020430/ [7] http://www.w3.org/TR/2003/WD-charmod-20030822/ ======================================================================== Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144 Oracle Corporation Oracle Email: mailto:jim.melton@oracle.com 1930 Viscounti Drive Standards email: mailto:jim.melton@acm.org Sandy, UT 84093-1063 Personal email: mailto:jim@melton.name USA Fax : +1.801.942.3345 ======================================================================== = Facts are facts. However, any opinions expressed are the opinions = = only of myself and may or may not reflect the opinions of anybody = = else with whom I may or may not have discussed the issues at hand. = ========================================================================
Received on Monday, 24 November 2003 18:25:24 UTC