- From: Jim Melton <jim.melton@oracle.com>
- Date: Mon, 24 Nov 2003 16:27:29 -0700
- To: public-qt-comments@w3.org
- Message-Id: <6.0.0.22.2.20031124144723.0681f670@gmstimap.oraclecorp.com>
Gentlepeople,
In [1], François Yergeau (in response to Ashok Malhotra's comment copied in
his response) observed:
The Serialization document does indeed mention *performing* normalization
on output (during serialization), but comment [83] also requests *checking*
normalization on input.
The "comment [83]" to which François' observation refers is found in
[2]. That comment (a formal comment by the W3C's Internationalization WG
on the Functions & Operators Last Call WD) reads:
[83] 7.4.11 normalize-unicode: Maybe not as a function, but in any case
somehow, normalization checking on input and normalization on output should
be available in both XQuery and XSLT, on full XML constructs (with the
relevant definitions form XML 1.1)
The issue being addressed has been under discussion for a considerable
time. Consider the following text taken from [3], which was transmitted to
the Internationalization (I18n) WG on 2001-02-21:
c) While early normalization has advantages for applications and for end
users, the XML Query WG has concerns with the costs (both runtime overhead
and development costs) associated with mandating it. We also believe that
it would be appropriate for the Character Model to permit conforming
implementations to provide a mechanism by which applications can explicitly
request normalization in situations where those applications know it will
be needed.
Subsequent discussions between the Query WG and the I18n WG demonstrated
that there are seriously different views of the problem and the
requirements. Some individuals felt very strongly that WWW
interoperability absolutely required that every processor follow the rule
that every text string must be normalized (using NFC) when it is created
and that no processor normalize any string that it is processing. Other
individuals strongly believed that there is an enormous legacy of data that
is not normalized and thus certain processors (if not all) cannot depend on
receiving data that is already normalized; such processors must (according
to this belief) be allowed to perform normalization on such data.
Another W3C private communication [4] from an active member of the I18n WG
included the observation:
Assuming further that most data happens to be normalized and further, that
most comparisons don't need to look at all the data, I'm not convinced that
a comparison that performs a parallel quick check that the portion of the
data actually used is normalized can't be made to execute much faster than
if the whole data had been normalized
This observation was eventually characterized as "a processor can check
whether data is normalized rather cheaply, and thus normalize only that
data that is not normalized".
Further communication from the I18n WG to the Query WG in [5] included the
following paragraphs:
We arrived at prohibition against text normalization on the receiver side
mainly by considering situations where an XML document is sent, and where
we have to make sure that all recipients interpret it the same way.
But it seemed to us that you might be thinking about a database, and that
you think it would make a lot of sense to offer text normalization for data
that enters the database, not the least to make sure that it is normalized
when it leaves the database (as the character model requires if we, quite
reasonably, consider the database a producer in that case).
Indeed, the use of XQuery in a database context was one basis for the Query
WG's position that early normalization was not reliable and thus that
processors should be allowed to normalize when required.
The subsequent version of the Character Model [6] continued the prohibition
of consumer processors performing normalization. The Query WG continued to
oppose that prohibition, in no small part because it appeared to require
that those consuming processors that encountered non-normalized data to
completely reject the data. That, the Query WG felt, was counter to users'
expectations both of data access and of the Web. The Query WG also opposed
a requirement in [6] that every individual character string operation
performed by a processor produce normalized results; the opposition was
partly on the grounds of performance (constant normalizing can be rather
CPU-intensive), but also on the grounds that internal behavior of a
processor is not subject to external imposition of arbitrary rules (that
is, it's none of their business).
As a result of the Query WG's additional comments, the I18n WG's official
response to the concerns about a prohibition of consuming processors
performing normalization was that the comment and proposed solution (no
such prohibition) was accepted for the next draft of [6]. In particular,
the I18n WG accepted the Query WG's provision of a function that allows
applications to explicitly request normalization at any time, including
normalization of a suspect document when it is first encountered.
Additional telephone discussion between a leading member of the I18n WG and
a member of the XSL WG resulted in the I18n WG confirming that the next
edition of [6] would relax the prohibitions against consuming applications
normalizing internal data. They also agreed that certain processors, such
as XSLT, required the ability to *produce* non-normalized data. (That last
item does not necessarily apply to XQuery or to XPath 2.0.)
The most recent version of [6], found in [7], does indeed relax most of
those prohibitions. Unfortunately, it still contains the prohibition
against a consuming application actually normalizing data. I believe that
the Query WG is likely to interpret this prohibition as a prohibition
against a consuming processor implicitly normalizing data, but not against
explicit, user-requested normalization. That might be a little bit sneaky
(or downright cynical), but it's the only way that I, and the Query WG, can
see forward into rationalizing both the Query WG's own requirements and the
I18n WG's efforts to make the Web truly interoperable.
SUMMARY: It is highly unlikely that the Query WG will modify XQuery to
mandate that all input it receives be NFC-normalized, but that the Query WG
will provide a function by which applications (queries) can explicitly
force normalization of input documents.
I hope that this very lengthy history explains why the specifications are
presently as they are and why there is no significant likelihood that this
situation will be changed.
Hope this helps,
Jim
[1] http://lists.w3.org/Archives/Public/public-qt-comments/2003Oct/0026.html
[2] http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0106.html
[3] http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001Feb/0247.html
(an internal, and thus private, W3C communication)
[4] http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001Feb/0411.html
(an internal, and thus private, W3C communication)
[5] http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2002Jan/0342.html
(an internal, and thus private, W3C communication)
[6] http://www.w3.org/TR/2002/WD-charmod-20020430/
[7] http://www.w3.org/TR/2003/WD-charmod-20030822/
========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL) Phone: +1.801.942.0144
Oracle Corporation Oracle Email: mailto:jim.melton@oracle.com
1930 Viscounti Drive Standards email: mailto:jim.melton@acm.org
Sandy, UT 84093-1063 Personal email: mailto:jim@melton.name
USA Fax : +1.801.942.3345
========================================================================
= Facts are facts. However, any opinions expressed are the opinions =
= only of myself and may or may not reflect the opinions of anybody =
= else with whom I may or may not have discussed the issues at hand. =
========================================================================
Received on Monday, 24 November 2003 18:25:24 UTC