xquery review from Felix Sasaki on 2005-04-27 (public-i18n-core@w3.org from April to June 2005)

From: Felix Sasaki <fsasaki@w3.org>
Date: Wed, 27 Apr 2005 21:14:09 +0900
To: public-i18n-core@w3.org
Message-ID: <426F8211.2050400@w3.org>
Dear all,

Today I went through some of the reviews from Martin on the xquery-suite, and through the current state of the documents. Below you will find the result. I have done "XPath 2.0", "XSLT 2.0 and XQuery 1.0 Serialization", "XML Syntax for XQuery 1.0 (XQueryX)", "XQuery 1.0: An XML Query Language." It would be great if you could go through the comments and make suggestions for corrections, before we send them to the xquery wg. Tomorrow I will start with the data model and Martin's comments.

Best regards, Felix.


------------------------------------------------------------

Name of specification: XML Path Language (XPath) 2.0

Document: http://www.w3.org/TR/2005/WD-xpath20-20050404/

Main reviewer: Felix Sasaki (fsasaki@w3.org)

------------------------------------------------------------

Comments:

[1] General comment on references to URIs: Throughout this and other
specs, please reference IRIs (RFC 3987,
http://www.ietf.org/rfc/rfc3987.txt) instead of URIs. You often refer to
the XML Schema data type xs:anyURI, e.g. "The URI value is whitespace
normalized according to the rules for the xs:anyURI type in [XML
Schema]." (sec. 2.1.1), but this data type itself is in its latest
version defined in terms of IRI. Referring to IRI directly in your
specification would make things clearer.

[2] Sec. 2.1.2, Definition of an "Implicit time zone"
(http://www.w3.org/TR/xpath20/#eval_context). This has to be removed.
Using implicit conversions between timezoned and non-timezoned dates and
times is way too prone to all kinds of subtle and not so subtle bugs.

[3] Sec. 2.2.3.1, "The operation tree is then normalized ..." （http:
//www.w3.org/TR/xpath20/#id-static-analysis）. There are many different
normalizations in this series of specifications, like operation tree
normalization in this section, white space normalization (sec. 3.10.2,
4c), Character normalization (Charmod, NFC etc.), normalization as
described in the formal semantics document, sec. 3.2.1, point 3, and
sequence normalization as described in the serialization specification,
sec. 2. These should be very clearly distinguished and labeled. A
section which summarizes the various kinds of normalization would be
helpful.

[4] Sec. 3.5.1 (http://www.w3.org/TR/xpath20/#id-value-comparisons). The
value comparison relies on atomization of the values; if these are
nodes, the atomized value is returned as a typed value. You should make
clear that this is quite different from the comparison of string values.
This difference might be important for some i18n applications. Consider
the following example:

<myEl1>bla<myEl2>&#x160;</myEl2></myEl1>

if there is a schema which declares the type of myEl2 as empty, &#x160;
would not be part of the PSVI and the result of

$myDoc/myEl1 eq "bla"

would be true, otherwise it would be false.

[5] References: The reference to ISO/IEC 10646 should be updated to the
newest version, i.e. ISO/IEC 10646:2003.

------------------------------------------------------------

Name of specification: XSLT 2.0 and XQuery 1.0 Serialization

Document: http://www.w3.org/TR/2005/WD-xslt-xquery-serialization-20050404/

Main reviewer: Felix Sasaki (fsasaki@w3.org)

------------------------------------------------------------

Comments:

[1] Sec. 2, point 3
(http://www.w3.org/TR/xslt-xquery-serialization/#serdm). "each separated
by a single space": Inserting a space may not be the right thing, in
particular for Chinese, Japanese. Thai, ... which don't have spaces
between words. This has to be checked very carefully.

[2] Sec. 3, serialization parameter 'encoding'
(http://www.w3.org/TR/xslt-xquery-serialization/#serparam). Given that
this is already required for the XML output method, we think it's highly
desirable to make the requirement for support for UTF-8 and UTF-16
general (including text output).

[3] Sec. 3, 'encoding'. Here or for each individual output method,
something should be said about the BOM. As for the byte-order-mark
parameter in sec. 3, you say "If the concept of a Byte Order Mark is not
meaningful in connection with the value of the encoding parameter, the
byte-order-mark parameter is ignored." We think in sec. 3 or for each
output method you could elaborate "meaningful" to the following:

- XML/XHTML: UTF-16: BOM required; UTF-8: may be used.

- HTML/text: UTF-16: BOM recommended; UTF-8: may be used.

[4] Sec. 3, 'encoding'. The respective sections for the individual
output methods (5.1.2, , 6.1.2, 7.4.2, 8.1.2) should say that for
UTF-16, endianness is implementation-dependent (or implementation-defined).

[5] Sec. 3, 'encoding'.The respective sections for the individual output
methods (5.1.2, , 6.1.2, 7.4.2, 8.1.2) should say that, in absence of an
'encoding' parameter, there should be a default of UTF-8.

[6] Section 3, 'include-content-type'. Please explain in more detail in
this section or in the sections for XHTML (6.1.13) / HTML (7.4.13) why
this parameter is necessary. It seems that it may be better to always
include a respective <meta> element in XHTML / HTML.

[7] Sec. 4, point 2a
(http://www.w3.org/TR/xslt-xquery-serialization/#serphases). You define
URI-escaping in terms of XLINK. We propose to refer to section 3.1 of
the IRI specification (RFC 3987) instead, because XLINK lacks a
normalization procedure to NFC which might be a necessary step for
mapping non-ASCII characters to ASCII characters.

[8] Sec. 5.1.2 (XML output method, encoding;
http://www.w3.org/TR/xslt-xquery-serialization/#XML_ENCODING). "When
outputting a newline character in the instance of the data model, the
serializer is free to represent it using any character sequence that
will be normalized to a newline character by an XML parser, unless a
specific mapping for the newline character is provided in a character
map: see 9 Character Maps." This should probably say that for
interoperability, it is better to avoid x85 and x2028. See sec. 2.11 of
XML 1.1 for further information.

[9] Sec. 5.1.5 (XML output method, omit-xml-declaration;
http://www.w3.org/TR/xslt-xquery-serialization/#XML_OMIT-XML-DECLARATION).
The interplay between omit-xml-declaration and the standalone parameter
might disallow producing xml documents which are in another encoding
than UTF-8 or UTF-16 and has no XML declaration. Nevertheless this
should be possible, e.g. if xml is served over HTTP with a corresponding
charset parameter. Also, with XML 1.1, the xml declaration is mandatory,
no matter what the values of omit-xml-declaration and standalone are.

[10] Sec. 6.1.12
(http://www.w3.org/TR/xslt-xquery-serialization/#XHTML_ESCAPE-URI-ATTRIBUTES)
and Sec. 7.4.12
(http://www.w3.org/TR/xslt-xquery-serialization/#HTML_ESCAPE-URI-ATTRIBUTES),
Note starting: "This escaping is deliberately confined to non-ASCII
characters ...". There are certain ASCII characters that are not allowed
in URIs, namely namely "<", ">", '"', space, "{", "}", "|", "\", "^",
and "`". They should be escaped.

[11] Sec. 7.3 (HTML Output Method: Writing Character Data;
http://www.w3.org/TR/xslt-xquery-serialization/#N10FE9). "When
outputting a sequence of whitespace characters in the data model, within
an element where whitespace is treated normally, (but not in elements
such as pre and textarea) the html output method may represent it using
any character sequence that will be treated as whitespace by an HTML
user agent." We need to check whether this (which allows replacement of
whitespace including linebreaks by whitespace not including linebreaks
and vice-versa) is okay for Chinese, Japanese, Thai, ... (languages
without spaces between words). This has to be checked extremely carefully.

[12] Sec. 8.1.13
(http://www.w3.org/TR/xslt-xquery-serialization/#TEXT_INCLUDE-CONTENT-TYPE).
The text should talk about "include-content-type" instead of
"escape-uri-attributes".

------------------------------------------------------------

Name of specification: XML Syntax for XQuery 1.0 (XQueryX)

Document: http://www.w3.org/TR/2005/WD-xqueryx-20050404/

Main reviewer: Felix Sasaki (fsasaki@w3.org)

------------------------------------------------------------

Comments:

[1] Sec. 5
(http://www.w3.org/TR/2005/WD-xqueryx-20050404/#TrivialEmbedding), "If
the XQuery contains characters that are prohibited in XML text (such as
< and &), they must be "escaped" as either character entity references
or character references." It should be made clear what is meant my
"prohibited in XML text", e.g. XML-predefined entities.

[2] C.2.
(http://www.w3.org/TR/2005/WD-xqueryx-20050404/#xqueryx-mime-registration),
concerning various subsections. Editorial: Please add RFC 3023 in the
reference section (Appendix A).

[3] C.2.1, encoding considerations, editoral. "The considerations as
specified in RFC 3023 [XMLMIME] also hold for 'application/xquery+xml'."
Please add a link to the section in RFC 3023 which deals with these
considerations, i.e. sec. 3.2.

------------------------------------------------------------

Name of specification: XQuery 1.0: An XML Query Language

Document: http://www.w3.org/TR/2005/WD-xquery-20050404/

Main reviewer: Felix Sasaki (fsasaki@w3.org)

------------------------------------------------------------

Comments:

[1] General. How can xml:lang be extracted from data and preserved with
a query? How can this be done without littering all elements with
unnecessary xml:lang attributes? The function fn:lang, defined in the
specification on functions and operators, provides some solution for the
extraction of xml:lang, but not for its generation in the output.
Something like the namespace-alias technique proposed by xslt 2.0 might
be useful for this purpose, see
http://www.w3.org/TR/2005/WD-xslt20-20050404/#namespace-aliasing

[2] General. There should be more non-US examples. For example, it is
very difficult for somebody not from the US to understand why there are
no Deep Sea Fishermen in Nebraska.

[3] 3.7.1.3 Content (http://www.w3.org/TR/xquery/#id-content):
serializing atomic values by inserting spaces may not be appropriate for
Chinese, Japanese, Thai,..., i.e. languages that don't use spaces
between words. This has to be checked very carefully.

[4] Sec. 3.7.2 (http://www.w3.org/TR/xquery/#id-otherConstructors). Not
requiring CDATA constructs to be serialized as CDATA sections is a good
idea, because it helps dispell the idea that CDATA sections are
semantically significant.

[5] For collations, namespaces, schemas, and so on, the production 141
"URILiteral" (sec. A.1 http://www.w3.org/TR/xquery/#id-grammar) is used,
which refers to a "StringLiteral". "URILiteral" should be changed to
"IRILiteral", and the reference section should contain an entry to the
IRI specification RFC3987. There should also be a clear indication how
XML Base affects collations, namespaces etc.

[6] It is only implementation-defined, whether XQuery supports XML 1.0
or XML 1.1
(http://www.w3.org/TR/2005/WD-xquery-20050404/#dt-implementation-defined).
There should be a feature in XQuery which allows to choose between these
two versions of XML.

[7] C.3 Serialization Parameters
(http://www.w3.org/TR/xquery/#id-xq-serialization-parameters). This
table must be updated with the respective table from the serialization
specification (http://www.w3.org/TR/xslt-xquery-serialization/#serparam).
Received on Wednesday, 27 April 2005 12:14:18 UTC