- From: Martin Duerst <duerst@w3.org>
- Date: Sun, 15 Feb 2004 12:37:30 -0500
- To: public-qt-comments@w3.org
- Cc: w3c-i18n-ig@w3.org, w3c-dom-wg@w3.org
Dear XML Query WG and XSL WG, Below please find the I18N WGs comments on your last call document "XSLT 2.0 and XQuery 1.0 Serialization" (http://www.w3.org/TR/2003/WD-xslt-xquery-serialization-20031112/). Please note the following: - Please address all replies to there comments to the I18N IG mailing list (w3c-i18n-ig@w3.org), not just to me. - Our comments are numbered in square brackets [nn]. We look forward to further discussion with you. [this mail is copied to the DOM WG to tell them what we are telling you about UTF-16 and endianness, which they should adopt for the Document Object Model (DOM) Level 3 Load and Save Specification] Overall positive comments (no disposition needed): [1] Addition of the XHTML output method is a good idea [2] Moving this material into the Data Model spec would be okay [3] 6.4 HTML Output Method: Writing Character Data: "Entity references and character references should be used only where the character is not present in the selected encoding, or where the visual representation of the character is unclear (as with , for example)." This is very good! General last call comments, i18n-related (disposition needed): [4] This only defines serialization into bytes. In some contexts (e.g. Databases, in-program,...), serialization into a stream of characters is also important. The spec should specify how this is done. [5] Section 2, point 3: "each separated by a single space": Inserting a space may not be the right thing, in particular for Chinese, Japanese, Thai,... which don't have spaces between words. This has to be checked very carefully. [6] Section 3, 'encoding': Given that this is already required for the XML output method, we think it's highly desirable to make the requirement for support for UTF-8 and UTF-16 general (including text). [7] Section 3, 'encoding': Here or for each individual output method, something should be said about the BOM. We think it should be the following: - XML/XHTML: UTF-16: required; UTF-8: may be used. - HTML/text: UTF-16: recommended; UTF-8: may be used. [8] Section 3, 'encoding': This should say that for UTF-16, endianness implementation-dependent (or implementation-defined) [9] Section 3, 'encoding': "If this parameter is not specified, and the output method does not specify any additional requirements, the encoding used is implementation defined." This should be more specific. In the absence of an 'encoding' parameter, information e.g. given to an implementation via an option, and specific information for a particular 'host language' (e.g. other than XQuery or XSLT), there should be a default of UTF-8. [12] Section 3, 'escape-uri-attributes' (and other places in this spec): RFC 2396, section 2.4.1, only specifies how to escape a string of bytes in an URI, and cannot directly be applied to a string of (Unicode) characters. In accordance with the IRI draft and many other W3C specifications, this must be specified to use UTF-8 first and then use RFC 2396, section 2.4.1 (%-escaping). [11] Section 3, 'include-content-type': Why is this parameter needed? It seems that it may be better to always include a <meta> element. Please remove the parameter or tell us when/why it's necessary to not have a <meta> element [12] The description of 'media-type' is confusing. Does it change something in the output, or only in the way the output is labelled? Does it affect the <meta>, if output? Can it affect other things, e.g. a Content-Type header in HTTP? This should be clarified. [13] Section 3, 'normalize-unicode': Using Normalization Form C is the right thing, but XML 1.1, in accordance with the Character Model, defines some additional start conditions in some cases. How are these guaranteed (e.g. by adding an initial space if necessary)? If there is no such guarantee, there should at least be a warning, but a guarantee is highly preferable. [14] Section 3, four phases of serialization: Character expansion comes before Encoding, but encoding depends on character expansion (using numeric character references for characters that don't exist in a certain encoding). This has to be sorted out very carefully and explained in detail, ideally with examples. There's also an interaction between mapping and normalization. If there's a mapping combining grave->̀, normalization must be aware that ̀ is not an ASCII string! [15] Section 4, "To anticipate the proposed changes to end-of-line handling in XML 1.1, implementations may also output the characters x85 and x2028 as character references. This will not affect the way they are interpreted by an XML 1.0 parser.": XML 1.1 is now a REC, so this is no longer anticipated. See http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-line-ends [16] Section 4.2 (XML output method, encoding): "If no encoding parameter is specified, then the processor must use either UTF-8 or UTF-16.": It may be desirable to further narrow this to UTF-8 for higher predictability. On the other hand, this should not say "If no encoding parameter is specified", but "If no encoding is specified (either with an encoding parameter or externally)" to allow e.g. specification of encoding with an option. [17] Section 4.2 (XML output method, encoding): "When outputting a newline character in the data model, the implementation is free to represent it using any character sequence that will be normalized to a newline character by an XML parser,...": This should probably says that for interoperability, it is better to avoid x85 and x2028. [18] Section 4.5 (XML output method, omit-xml-declaration): "The omit-xml-declaration parameter must be ignored if the standalone parameter is present, or if the encoding parameter specifies a value other than UTF-8 or UTF-16.": This disallows producing XML other than UTF-8 or UTF-16 without an xml declaration even though this is legal e.g. if served over HTTP with a corresponding charset parameter. We are not sure this is intended, and we are not sure this is a good thing. On the other hand, omit-xml-declaration must also be ignored if version is not 1.0. [19] 6.4 HTML Output Method: Writing Character Data: "When outputting a sequence of whitespace characters in the data model, within an element where whitespace is treated normally, (but not in elements such as pre and textarea) the html output method may represent it using any character sequence that will be treated as whitespace by an HTML user agent.": @@@ We need to check whether this (which allows replacement of whitespace including linebreaks by whitespace not including linebreaks and vice-versa) is okay for Chinese, Japanese, Thai,... (languages without spaces between words). This has to be checked extremely carefully. [20] 6.4 HTML Output Method: Writing Character Data: "Certain characters, specifically the control characters #x7F-#x9F, are legal in XML but not in HTML. ... The processor may signal the error, but is not required to do so.": Please change this to require the processor to produce an error. [21] Section 8: There should be a reference to XSLT to show examples of use of character maps. [22] There shouldbe some warning about denormalization when using charmaps General last call comments, not i18n-related: [23] Section 4: "The base URIs of nodes in the two trees may be different." Does this mean that base URIs are not serialized? This should be checked or at least explained. [24] Cases of creation of non-wellformed XML where the processor is not required to signal an error: It would be good to have an option to request well-formedness checking even if Character Maps are used. [25] 7, Text Output Method: "The media-type parameter is applicable for the text output method.": What does that mean? How is it applied? Editorial: [26] Normalization: This term is used for different things: - Character normalization (Charmod, NFC) - Normalization as described in section 2 of this document. - Normalization as described in the formal semantics document. These should be very clearly distinguished and labeled. [27] Section 3, 'media-type', says "... the charset parameter of the media type must not be specified explicitly". This should be changed to "... the charset parameter of the media type must not be specified explicitly here." to make clear that this is just a statement about this parameter, not in general. [28] Section 3, "omit-xml-declaration specifies whether the serialization process is to output an XML declaration. The value must be yes or no If this parameter is not specified, the value is implementation defined." The wording should be improved to make clear which is yes and which is no. (and please add a period after 'no'). [29] Section 4: "Additional nodes may be present in the new tree, and the values of attribute nodes and text nodes in the new tree may be different from those in the original tree, due to the character expansion phase of serialization.": this should clearly state that this applies only to URI escaping and character mapping, and that CDATA sections and escaping of special characters cannot create differences. [30] 4.8: "If the output method is xml and the value of the version parameter is 1.0, namespace >UN<declaration is not performed, and the undeclare-namespace parameter is ignored." [31] Section 5 and Section 6: "If the data model includes a head element that has a meta element child, the processor should replace any content attribute of the meta element, or add such an attribute, with the value as described above, rather than output a new meta element." This is written as if there would be only one <meta> element. Replacement should only take place if the <meta> element has a http-equiv attribute with value 'Content-Type'. [32] Section 5 and Section 6: Note starting: "This escaping is deliberately confined to non-ASCII characters,": There are certain ASCII characters that are not allowed in URIs. They should be escaped. [33] Section 7, freestanding paragraph "The default encoding for the text output method is implementation-defined.": this is a repetition from the previous paragraph and should be removed. [34] RFC 2376 is obsoleted by RFC 3023. Regards, Martin.
Received on Sunday, 15 February 2004 12:37:43 UTC