Re: I18N last call comments on XQuery/XPath Fun/Op (2nd part) [83] from Jim Melton on 2003-11-24 (public-qt-comments@w3.org from November 2003)

From: Jim Melton <jim.melton@oracle.com>
Date: Mon, 24 Nov 2003 16:27:29 -0700
To: public-qt-comments@w3.org
Message-Id: <6.0.0.22.2.20031124144723.0681f670@gmstimap.oraclecorp.com>
Gentlepeople,

In [1], François Yergeau (in response to Ashok Malhotra's comment copied in 
his response) observed:
The Serialization document does indeed mention *performing* normalization 
on output (during serialization), but comment [83] also requests *checking* 
normalization on input.
The "comment [83]" to which François' observation refers is found in 
[2].  That comment (a formal comment by the W3C's Internationalization WG 
on the Functions & Operators Last Call WD) reads:
[83] 7.4.11 normalize-unicode: Maybe not as a function, but in any case 
somehow, normalization checking on input and normalization on output should 
be available in both XQuery and XSLT, on full XML constructs (with the 
relevant definitions form XML 1.1)
The issue being addressed has been under discussion for a considerable 
time.  Consider the following text taken from [3], which was transmitted to 
the Internationalization (I18n) WG on 2001-02-21:
c) While early normalization has advantages for applications and for end 
users, the XML Query WG has concerns with the costs (both runtime overhead 
and development costs) associated with mandating it.  We also believe that 
it would be appropriate for the Character Model to permit conforming 
implementations to provide a mechanism by which applications can explicitly 
request normalization in situations where those applications know it will 
be needed.
Subsequent discussions between the Query WG and the I18n WG demonstrated 
that there are seriously different views of the problem and the 
requirements.  Some individuals felt very strongly that WWW 
interoperability absolutely required that every processor follow the rule 
that every text string must be normalized (using NFC) when it is created 
and that no processor normalize any string that it is processing.  Other 
individuals strongly believed that there is an enormous legacy of data that 
is not normalized and thus certain processors (if not all) cannot depend on 
receiving data that is already normalized; such processors must (according 
to this belief) be allowed to perform normalization on such data.

Another W3C private communication [4] from an active member of the I18n WG 
included the observation:
Assuming further that most data happens to be normalized and further, that 
most comparisons don't need to look at all the data, I'm not convinced that 
a comparison that performs a parallel quick check that the portion of the 
data actually used is normalized can't be made to execute much faster than 
if the whole data had been normalized
This observation was eventually characterized as "a processor can check 
whether data is normalized rather cheaply, and thus normalize only that 
data that is not normalized".

Further communication from the I18n WG to the Query WG in [5] included the 
following paragraphs:
We arrived at prohibition against text normalization on the receiver side 
mainly by considering situations where an XML document is sent, and where 
we have to make sure that all recipients interpret it the same way.
But it seemed to us that you might be thinking about a database, and that 
you think it would make a lot of sense to offer text normalization for data 
that enters the database, not the least to make sure that it is normalized 
when it leaves the database (as the character model requires if we, quite 
reasonably, consider the database a producer in that case).
Indeed, the use of XQuery in a database context was one basis for the Query 
WG's position that early normalization was not reliable and thus that 
processors should be allowed to normalize when required.

The subsequent version of the Character Model [6] continued the prohibition 
of consumer processors performing normalization.  The Query WG continued to 
oppose that prohibition, in no small part because it appeared to require 
that those consuming processors that encountered non-normalized data to 
completely reject the data.  That, the Query WG felt, was counter to users' 
expectations both of data access and of the Web.  The Query WG also opposed 
a requirement in [6] that every individual character string operation 
performed by a processor produce normalized results; the opposition was 
partly on the grounds of performance (constant normalizing can be rather 
CPU-intensive), but also on the grounds that internal behavior of a 
processor is not subject to external imposition of arbitrary rules (that 
is, it's none of their business).

As a result of the Query WG's additional comments, the I18n WG's official 
response to the concerns about a prohibition of consuming processors 
performing normalization was that the comment and proposed solution (no 
such prohibition) was accepted for the next draft of [6].  In particular, 
the I18n WG accepted the Query WG's provision of a function that allows 
applications to explicitly request normalization at any time, including 
normalization of a suspect document when it is first encountered.

Additional telephone discussion between a leading member of the I18n WG and 
a member of the XSL WG resulted in the I18n WG confirming that the next 
edition of [6] would relax the prohibitions against consuming applications 
normalizing internal data.  They also agreed that certain processors, such 
as XSLT, required the ability to *produce* non-normalized data.  (That last 
item does not necessarily apply to XQuery or to XPath 2.0.)

The most recent version of [6], found in [7], does indeed relax most of 
those prohibitions.  Unfortunately, it still contains the prohibition 
against a consuming application actually normalizing data.  I believe that 
the Query WG is likely to interpret this prohibition as a prohibition 
against a consuming processor implicitly normalizing data, but not against 
explicit, user-requested normalization.  That might be a little bit sneaky 
(or downright cynical), but it's the only way that I, and the Query WG, can 
see forward into rationalizing both the Query WG's own requirements and the 
I18n WG's efforts to make the Web truly interoperable.

SUMMARY: It is highly unlikely that the Query WG will modify XQuery to 
mandate that all input it receives be NFC-normalized, but that the Query WG 
will provide a function by which applications (queries) can explicitly 
force normalization of input documents.

I hope that this very lengthy history explains why the specifications are 
presently as they are and why there is no significant likelihood that this 
situation will be changed.

Hope this helps,
    Jim

[1] http://lists.w3.org/Archives/Public/public-qt-comments/2003Oct/0026.html

[2] http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0106.html

[3] http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001Feb/0247.html 
(an internal, and thus private, W3C communication)

[4] http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001Feb/0411.html 
(an internal, and thus private, W3C communication)

[5] http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2002Jan/0342.html 
(an internal, and thus private, W3C communication)

[6] http://www.w3.org/TR/2002/WD-charmod-20020430/

[7] http://www.w3.org/TR/2003/WD-charmod-20030822/


========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
Oracle Corporation            Oracle Email: mailto:jim.melton@oracle.com
1930 Viscounti Drive          Standards email: mailto:jim.melton@acm.org
Sandy, UT 84093-1063              Personal email: mailto:jim@melton.name
USA                                                Fax : +1.801.942.3345
========================================================================
=  Facts are facts.  However, any opinions expressed are the opinions  =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================
Received on Monday, 24 November 2003 18:25:24 UTC