RE: [Serial] I18N WG last call comments from Michael Kay on 2004-02-15 (public-qt-comments@w3.org from February 2004)

From: Michael Kay <mhk@mhk.me.uk>
Date: Sun, 15 Feb 2004 23:15:52 -0000
To: "'Martin Duerst'" <duerst@w3.org>, <public-qt-comments@w3.org>
Cc: <w3c-i18n-ig@w3.org>, <w3c-dom-wg@w3.org>
Message-ID: <001001c3f419$a84cc8b0$6401a8c0@pcukmka>
Some personal replies to some of the comments (for which, thanks):
> 
> General last call comments, i18n-related (disposition needed):
> 
> [4] This only defines serialization into bytes. In some contexts
>    (e.g. Databases, in-program,...), serialization into a stream
>    of characters is also important. The spec should specify how
>    this is done.

I agree.
> 
> [5] Section 2, point 3: "each separated by a single space":
>    Inserting a space may not be the right thing, in particular for
>    Chinese, Japanese, Thai,... which don't have spaces between words.
>    This has to be checked very carefully.

This isn't trying to achieve linguistic separation, it is trying to
achieve separation of tokens that meets the rules defined in XML Schema.
XML Schema allows any sequence of whitespace characters between the
items in a list, we mandate a single space character because that's the
simplest whitespace sequence.
> 
> [6] Section 3, 'encoding': Given that this is already required for
>    the XML output method, we think it's highly desirable to make
>    the requirement for support for UTF-8 and UTF-16 general
>    (including text).

I can't think of any reason not to make this change.
> 
> [7] Section 3, 'encoding': Here or for each individual output method,
>    something should be said about the BOM. We think it should be
>    the following:
>    - XML/XHTML: UTF-16: required; UTF-8: may be used.
>    - HTML/text: UTF-16: recommended; UTF-8: may be used.

I agree. In Saxon, I've added an extension attribute to control whether
a BOM should be emitted, and I think it would be a good idea to make
this a standard feature. The default should be yes for UTF-16, no for
UTF-8.
> 
> [8] Section 3, 'encoding': This should say that for UTF-16,
>    endianness implementation-dependent (or implementation-defined)

Agreed.
> 
> [9] Section 3, 'encoding': "If this parameter is not specified, and
>    the output method does not specify any additional requirements,
>    the encoding used is implementation defined."
>    This should be more specific. In the absence of an 'encoding'
>    parameter, information e.g. given to an implementation via an
>    option, and specific information for a particular 'host language'
>    (e.g. other than XQuery or XSLT), there should be a default of
>    UTF-8.

Off-hand, I don't see any objection to this except that it might give
some vendors a backwards compatibility problem.
> 
> [12] Section 3, 'escape-uri-attributes' (and other places in 
> this spec):
>    RFC 2396, section 2.4.1, only specifies how to escape a string of
>    bytes in an URI, and cannot directly be applied to a string of
>    (Unicode) characters. In accordance with the IRI draft and many
>    other W3C specifications, this must be specified to use UTF-8
>    first and then use RFC 2396, section 2.4.1 (%-escaping).

Agreed.
> 
> [11] Section 3, 'include-content-type': Why is this parameter needed?
>    It seems that it may be better to always include a <meta> element.
>    Please remove the parameter or tell us when/why it's necessary to
>    not have a <meta> element

This parameter has been requested by users a number of times, but the
situations that justify it are difficult to describe concisely. The
simplest case is where the user wants to output the meta element "by
hand", to give greater control. The other cases I've seen are where the
encoding isn't known until after subsequent stages in the processing
pipeline.
> 
> [12] The description of 'media-type' is confusing. Does it change
>    something in the output, or only in the way the output is labelled?
>    Does it affect the <meta>, if output? Can it affect other things,
>    e.g. a Content-Type header in HTTP? This should be clarified.

You're not the only one who's confused. It's often used by
transformation servlets to set the HTTP headers, but as far as the
serializer itself is concerned, it's documentary.
> 
> [13] Section 3, 'normalize-unicode': Using Normalization Form C is
>    the right thing, but XML 1.1, in accordance with the Character
>    Model, defines some additional start conditions in some cases.
>    How are these guaranteed (e.g. by adding an initial space if
>    necessary)? If there is no such guarantee, there should at least
>    be a warning, but a guarantee is highly preferable.

Your input on this is welcome!
> 
> [14] Section 3, four phases of serialization: Character expansion
>    comes before Encoding, but encoding depends on character
>    expansion (using numeric character references for characters
>    that don't exist in a certain encoding). This has to be
>    sorted out very carefully and explained in detail, ideally
>    with examples. There's also an interaction between mapping and
>    normalization.  If there's a mapping combining grave->&#x300;,
>    normalization must be aware that &#x300; is not an ASCII string!

You are probably right that we need to analyze and explain the
interactions between the different options better than we do at the
moment.
> 
> [15] Section 4, "To anticipate the proposed changes to end-of-line
>    handling in XML 1.1, implementations may also output the characters
>    x85 and x2028 as character references. This will not affect the way
>    they are interpreted by an XML 1.0 parser.": XML 1.1 is now a REC,
>    so this is no longer anticipated. See
>    http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-line-ends

Yes. Now that XML+NS 1.1 is at Rec status, I think the WGs need to take
a fresh top-level look at our policy towards them; serialization is just
one aspect of this.
> 
> [16] Section 4.2 (XML output method, encoding): "If no 
> encoding parameter
>    is specified, then the processor must use either UTF-8 or UTF-16.":
>    It may be desirable to further narrow this to UTF-8 for higher
>    predictability. On the other hand, this should not say
>    "If no encoding parameter is specified", but "If no encoding
>    is specified (either with an encoding parameter or externally)"
>    to allow e.g. specification of encoding with an option.

On the first point: yes, perhaps.

On the second, the serializer is driven by a set of parameters. I think
that by the time the serializer is invoked, the parameter values have
been fully computed, regardless where they came from, so the
serialization spec does not need to discuss different ways of supplying
the parameters.
> 
> [17] Section 4.2 (XML output method, encoding): "When 
> outputting a newline
>    character in the data model, the implementation is free to 
> represent
>    it using any character sequence that will be normalized to 
> a newline
>    character by an XML parser,...": This should probably says that
>    for interoperability, it is better to avoid x85 and x2028.

I don't see a specific need to say that: if you're generating XML 1.0
then you need to avoid these characters and if you're generating XML 1.1
then you don't. This seems to be covered by the statement as written.
> 
> [18] Section 4.5 (XML output method, omit-xml-declaration): "The
>    omit-xml-declaration parameter must be ignored if the standalone
>    parameter is present, or if the encoding parameter specifies a
>    value other than UTF-8 or UTF-16.": This disallows producing
>    XML other than UTF-8 or UTF-16 without an xml declaration even
>    though this is legal e.g. if served over HTTP with a corresponding
>    charset parameter. We are not sure this is intended, and we
>    are not sure this is a good thing. On the other hand,
>    omit-xml-declaration must also be ignored if version is not 1.0.

This rule overriding omit-xml-declaration has proved controversial with
some users, usually because they want to output fragments of XML that
they can concatenate into a single file. We should review it. On the
other hand, users do complain if the serializer produces output that an
XML parser then rejects.
> 
> [19] 6.4 HTML Output Method: Writing Character Data: "When outputting
>    a sequence of whitespace characters in the data model, within an
>    element where whitespace is treated normally, (but not in elements
>    such as pre and textarea) the html output method may represent it
>    using any character sequence that will be treated as whitespace
>    by an HTML user agent.": @@@ We need to check whether this (which
>    allows replacement of whitespace including linebreaks by whitespace
>    not including linebreaks and vice-versa) is okay for Chinese,
>    Japanese, Thai,... (languages without spaces between words).
>    This has to be checked extremely carefully.

I think it's better if we don't try to define the detailed rules here,
but just state the constraint: you can replace one whitespace sequence
by another if user agents treat them as equivalent. If we try to be more
precise than this, we will get it wrong.
> 
> [20] 6.4 HTML Output Method: Writing Character Data: "Certain 
> characters,
>    specifically the control characters #x7F-#x9F, are legal in XML but
>    not in HTML. ... The processor may signal the error, but is not
>    required to do so.": Please change this to require the processor
>    to produce an error.

I worry that we will get many complaints from users who are misusing
these codepoints if we do this. Their code will stop working, and it may
be quite difficult for them to fix it. (Though it's a good use case for
character maps...)
> 
> [21] Section 8: There should be a reference to XSLT to show examples
>    of use of character maps.

I leave this to the editor!
> 
> [22] There shouldbe some warning about denormalization when using
>    charmaps

I agree.
> 
> 
> 
> General last call comments, not i18n-related:
> 
> [23] Section 4: "The base URIs of nodes in the two trees may 
> be different."
>    Does this mean that base URIs are not serialized? This should be
>    checked or at least explained.

Yes, the base URI typically is supplied at the time a tree is built by a
parser, it is not normally explicit in the content of the tree.
> 
> [24] Cases of creation of non-wellformed XML where the 
> processor is not
>    required to signal an error: It would be good to have an option to
>    request well-formedness checking even if Character Maps are used.

Perhaps.
> 
> [25] 7, Text Output Method: "The media-type parameter is 
> applicable for
>    the text output method.": What does that mean? How is it applied?
>
It means, go and read the general (method-independent) description of
this parameter (which in this case, is not very enlightening...) 

> [32] Section 5 and Section 6: Note starting: "This escaping 
> is deliberately
>    confined to non-ASCII characters,": There are certain 
> ASCII characters
>    that are not allowed in URIs. They should be escaped.

The decision here is very deliberate, as the text says. Note that
appendix B.2.1 of the HTML 4.0 specification also refers to %HH escaping
only in connection with non-ASCII characters.

Although characters such as spaces are not allowed in URIs, if you
escape them in URIs that are interpreted client-side, such as
javascript: URIs, the URI stops working in most browsers.  

Also, you can't escape an id attribute that acts as the target of a
link, because % is not valid in an ID attribute. In practice (whatever
the spec says) if you escape the URI fragment identifier of a same-page
URI reference but don't escape the corresponding ID attribute, the
browser doesn't match them up. In fact, the evidence appears to be that
browsers don't unescape URIs at all, they leave this to be done at the
server. Escaping non-ASCII characters, as we currently specify, appears
to work for fragment identifiers referring to a different page, but not
for same-page references. It's a mess, which is one reason why we now
provide the option to switch off automatic escaping of URIs and allow
the user to do it themselves using the escape-uri() function.

Regards,

Michael Kay
Received on Sunday, 15 February 2004 18:15:13 UTC