[Serial] I18N WG last call comments from Martin Duerst on 2004-02-15 (public-qt-comments@w3.org from February 2004)

From: Martin Duerst <duerst@w3.org>
Date: Sun, 15 Feb 2004 12:37:30 -0500
To: public-qt-comments@w3.org
Cc: w3c-i18n-ig@w3.org, w3c-dom-wg@w3.org
Message-Id: <4.2.0.58.J.20040211164817.042d5210@localhost>
Dear XML Query WG and XSL WG,

Below please find the I18N WGs comments on your last call document
"XSLT 2.0 and XQuery 1.0 Serialization"
(http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20031112/).

Please note the following:
- Please address all replies to there comments to the I18N IG mailing
   list (w3c-i18n-ig@w3.org), not just to me.
- Our comments are numbered in square brackets [nn].


We look forward to further discussion with you.

[this mail is copied to the DOM WG to tell them what we are
telling you about UTF-16 and endianness, which they should
adopt for the
Document Object Model (DOM) Level 3 Load and Save Specification]


Overall positive comments (no disposition needed):

[1] Addition of the XHTML output method is a good idea

[2] Moving this material into the Data Model spec would be okay

[3] 6.4 HTML Output Method: Writing Character Data: "Entity references
   and character references should be used only where the character
   is not present in the selected encoding, or where the visual
   representation of the character is unclear (as with &nbsp;,
   for example)."
   This is very good!



General last call comments, i18n-related (disposition needed):

[4] This only defines serialization into bytes. In some contexts
   (e.g. Databases, in-program,...), serialization into a stream
   of characters is also important. The spec should specify how
   this is done.

[5] Section 2, point 3: "each separated by a single space":
   Inserting a space may not be the right thing, in particular for
   Chinese, Japanese, Thai,... which don't have spaces between words.
   This has to be checked very carefully.

[6] Section 3, 'encoding': Given that this is already required for
   the XML output method, we think it's highly desirable to make
   the requirement for support for UTF-8 and UTF-16 general
   (including text).

[7] Section 3, 'encoding': Here or for each individual output method,
   something should be said about the BOM. We think it should be
   the following:
   - XML/XHTML: UTF-16: required; UTF-8: may be used.
   - HTML/text: UTF-16: recommended; UTF-8: may be used.

[8] Section 3, 'encoding': This should say that for UTF-16,
   endianness implementation-dependent (or implementation-defined)

[9] Section 3, 'encoding': "If this parameter is not specified, and
   the output method does not specify any additional requirements,
   the encoding used is implementation defined."
   This should be more specific. In the absence of an 'encoding'
   parameter, information e.g. given to an implementation via an
   option, and specific information for a particular 'host language'
   (e.g. other than XQuery or XSLT), there should be a default of
   UTF-8.

[12] Section 3, 'escape-uri-attributes' (and other places in this spec):
   RFC 2396, section 2.4.1, only specifies how to escape a string of
   bytes in an URI, and cannot directly be applied to a string of
   (Unicode) characters. In accordance with the IRI draft and many
   other W3C specifications, this must be specified to use UTF-8
   first and then use RFC 2396, section 2.4.1 (%-escaping).

[11] Section 3, 'include-content-type': Why is this parameter needed?
   It seems that it may be better to always include a <meta> element.
   Please remove the parameter or tell us when/why it's necessary to
   not have a <meta> element

[12] The description of 'media-type' is confusing. Does it change
   something in the output, or only in the way the output is labelled?
   Does it affect the <meta>, if output? Can it affect other things,
   e.g. a Content-Type header in HTTP? This should be clarified.

[13] Section 3, 'normalize-unicode': Using Normalization Form C is
   the right thing, but XML 1.1, in accordance with the Character
   Model, defines some additional start conditions in some cases.
   How are these guaranteed (e.g. by adding an initial space if
   necessary)? If there is no such guarantee, there should at least
   be a warning, but a guarantee is highly preferable.

[14] Section 3, four phases of serialization: Character expansion
   comes before Encoding, but encoding depends on character
   expansion (using numeric character references for characters
   that don't exist in a certain encoding). This has to be
   sorted out very carefully and explained in detail, ideally
   with examples. There's also an interaction between mapping and
   normalization.  If there's a mapping combining grave->&#x300;,
   normalization must be aware that &#x300; is not an ASCII string!

[15] Section 4, "To anticipate the proposed changes to end-of-line
   handling in XML 1.1, implementations may also output the characters
   x85 and x2028 as character references. This will not affect the way
   they are interpreted by an XML 1.0 parser.": XML 1.1 is now a REC,
   so this is no longer anticipated. See
   http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-line-ends

[16] Section 4.2 (XML output method, encoding): "If no encoding parameter
   is specified, then the processor must use either UTF-8 or UTF-16.":
   It may be desirable to further narrow this to UTF-8 for higher
   predictability. On the other hand, this should not say
   "If no encoding parameter is specified", but "If no encoding
   is specified (either with an encoding parameter or externally)"
   to allow e.g. specification of encoding with an option.

[17] Section 4.2 (XML output method, encoding): "When outputting a newline
   character in the data model, the implementation is free to represent
   it using any character sequence that will be normalized to a newline
   character by an XML parser,...": This should probably says that
   for interoperability, it is better to avoid x85 and x2028.

[18] Section 4.5 (XML output method, omit-xml-declaration): "The
   omit-xml-declaration parameter must be ignored if the standalone
   parameter is present, or if the encoding parameter specifies a
   value other than UTF-8 or UTF-16.": This disallows producing
   XML other than UTF-8 or UTF-16 without an xml declaration even
   though this is legal e.g. if served over HTTP with a corresponding
   charset parameter. We are not sure this is intended, and we
   are not sure this is a good thing. On the other hand,
   omit-xml-declaration must also be ignored if version is not 1.0.

[19] 6.4 HTML Output Method: Writing Character Data: "When outputting
   a sequence of whitespace characters in the data model, within an
   element where whitespace is treated normally, (but not in elements
   such as pre and textarea) the html output method may represent it
   using any character sequence that will be treated as whitespace
   by an HTML user agent.": @@@ We need to check whether this (which
   allows replacement of whitespace including linebreaks by whitespace
   not including linebreaks and vice-versa) is okay for Chinese,
   Japanese, Thai,... (languages without spaces between words).
   This has to be checked extremely carefully.

[20] 6.4 HTML Output Method: Writing Character Data: "Certain characters,
   specifically the control characters #x7F-#x9F, are legal in XML but
   not in HTML. ... The processor may signal the error, but is not
   required to do so.": Please change this to require the processor
   to produce an error.

[21] Section 8: There should be a reference to XSLT to show examples
   of use of character maps.

[22] There shouldbe some warning about denormalization when using
   charmaps



General last call comments, not i18n-related:

[23] Section 4: "The base URIs of nodes in the two trees may be different."
   Does this mean that base URIs are not serialized? This should be
   checked or at least explained.

[24] Cases of creation of non-wellformed XML where the processor is not
   required to signal an error: It would be good to have an option to
   request well-formedness checking even if Character Maps are used.

[25] 7, Text Output Method: "The media-type parameter is applicable for
   the text output method.": What does that mean? How is it applied?


Editorial:

[26] Normalization: This term is used for different things:
   - Character normalization (Charmod, NFC)
   - Normalization as described in section 2 of this document.
   - Normalization as described in the formal semantics document.
   These should be very clearly distinguished and labeled.

[27] Section 3, 'media-type', says "... the charset parameter of the
   media type must not be specified explicitly". This should be
   changed to "... the charset parameter of the media type must
   not be specified explicitly here." to make clear that this
   is just a statement about this parameter, not in general.

[28] Section 3, "omit-xml-declaration specifies whether the serialization
   process is to output an XML declaration. The value must be yes or no
   If this parameter is not specified, the value is implementation defined."
   The wording should be improved to make clear which is yes and which
   is no. (and please add a period after 'no').

[29] Section 4: "Additional nodes may be present in the new tree, and
   the values of attribute nodes and text nodes in the new tree may be
   different from those in the original tree, due to the character
   expansion phase of serialization.": this should clearly state
   that this applies only to URI escaping and character mapping, and
   that CDATA sections and escaping of special characters cannot
   create differences.

[30] 4.8: "If the output method is xml and the value of the version
   parameter is 1.0, namespace >UN<declaration is not performed,
   and the undeclare-namespace parameter is ignored."

[31] Section 5 and Section 6: "If the data model includes a head element
   that has a meta element child, the processor should replace any
   content attribute of the meta element, or add such an attribute,
   with the value as described above, rather than output a new meta element."
   This is written as if there would be only one <meta> element.
   Replacement should only take place if the <meta> element has a
   http-equiv attribute with value 'Content-Type'.

[32] Section 5 and Section 6: Note starting: "This escaping is deliberately
   confined to non-ASCII characters,": There are certain ASCII characters
   that are not allowed in URIs. They should be escaped.

[33] Section 7, freestanding paragraph "The default encoding for the text
   output method is implementation-defined.": this is a repetition from
   the previous paragraph and should be removed.

[34] RFC 2376 is obsoleted by RFC 3023.


Regards,    Martin.
Received on Sunday, 15 February 2004 12:37:43 UTC