RE: Review on Polyglot Markup Draft from Eliot Graff on 2011-01-26 (public-html@w3.org from January 2011)

From: Eliot Graff <eliotgra@microsoft.com>
Date: Wed, 26 Jan 2011 23:59:08 +0000
To: Lachlan Hunt <lachlan.hunt@lachy.id.au>, public-html <public-html@w3.org>
Message-ID: <CE3A5BFD1228D84A8D9C158EEC195FD525CC8367@TK5EX14MBXW603.wingroup.windeploy.ntde>

In going through old mail to ensure that I haven't missed anything, I found that I neglected to rectify some of the issues raised by this mail.

The Editor's Draft of the polyglot spec of 26 January has incorporated fixes for the comments about Void/Empty elements and about attribute values. I believe everything else was already incorporated, as appropriate.

Thank for the feedback and your patience.

Eliot

]]
Hi,
This is my review of the current Polyglot Markup draft.

My first problem is that it purports to be a normative document with normative requirements and references. It should instead be an informative note describing the requirements that are derived from the intersection of HTML and XHTML requirements as defined in HTML5. I hope the intention is for this draft to eventually be published as a WG note and is not on the Rec track. (I was referred to bug 9969 in IRC for this issue, and so I will document my rationale for this more fully there later)

*Character Encoding*

The draft states:

"When polyglot markup uses UTF-16, it should include the BOM
indicating UTF-16LE or UTF-16BE."

I realise that text is copied from an e-mail I wrote myself on the topic a while ago, but the description is slightly misleading with regards to what UTF-16LE and UTF-16BE are, and should be rephrased. I suggest it be rephrased like this:

When polyglot markup uses UTF-16, the Byte Order Mark (BOM) must be
included. The BOM is used to indicate whether the encoding is
big-endian or little-endian.

(You could also omit the second sentence from that, as it may not be necessary to provide that bit of trivia to readers.)

"In addition, polyglot markup need not include the meta charset
declaration, because the parser would have to read UTF-16 in order
to parse it by definition."

This too should be updated to state that, at least per the current spec, inclusion of the meta charset declaring UTF-16 (or any other non-ASCII compatible encoding) is forbidden.

"Use both the XML Declaration and meta tag to specify the appropriate
character encoding."

This is wrong. The XML declaration cannot be used. This requirement contradicts the previous section in the draft where it is correctly noted that "Processing Instructions and the XML Declaration are both forbidden in polyglot markup."

Remove the incorrect advice from this section, and state that only UTF-8 or UTF-16 may be used. Technically you could also say that other encodings can be used if declared at the protocol level (Content-Type metadata), but such advice if included should be accompanied by a strong warning to authors to avoid alternative encodings.

*The DOCTYPE*

I suggest you provide an example illustrating the about:legacy-compat DOCTYPE.

The list of rules for the DOCTYPE syntax should state that it must conform to the rules for XML DOCTYPEs.

"Polyglot markup may use any other XHTML document type declaration
with a referenced DTD,..."

This is incorrect. The list of XHTML DOCTYPEs permitted for use in
HTML5 content are only those listed as obsolete but permitted. This includes XHTML 1.0 Strict and XHTML 1.1.

The use of any other DOCTYPE is not permitted in polyglot HTML5, because no other XHTML DOCTYPEs are considered conforming in HTML5. Such DOCTYPEs can be used in XHTML-only documents, where there are no restrictions on the permitted DOCTYPEs. But such documents are not to be considered conforming polyglot documents.

"However, note that by using a document type declaration that
references a DTD, the document is required to follow the rules of
the DTD. The rules of the DTD may or may not be compatible with
polyglot markup."

That is not a requirement imposed by the HTML5 specification. The point of permitting the limited set of obsolete DOCTYPEs is to assist with the transition period, so that new HTML5 features can be incorporated into existing pages, and still claim conformance with HTML5. The requirements of their respective obsolete specs are not relevant to an
HTML5 conformance claim.

*Namespaces*

"... The prefix must be declared on an SVG or MathML element by using
an attribute in the xlink namespace or on any of its SVG or MathML
ancestors."

That statement does not make sense. What does it mean to declare the prefix "by using an attribute in the xlink namespace"? I believe the statement is just trying to state that the prefix must be declared before xlink:href can be used.

*Case Sensitivity*

Element Names:
"Polyglot markup uses the correct case for element names."

Please refer to this as the "canonical case". This also applies to the Attribute Names section too.

Attribute Values:

This section lists a set of attributes for which their values are supposedly case sensitive and require lowercase values, which is not true. The list itself appears to be derived from the requirements of case insensitivity of attribute selectors in the spec, as applied to HTML elements in HTML documents.

In HTML5, that list is specifically written as user agent requirements for selector matching. You cannot directly derive document authoring requirements from this list. However, by attempting to do so, the list imposes some requirements on authors for which there are no such requirements in the spec.

For the purpose of selector matching, attribute values in XML are all treated case sensitively (except where noted in the user agent style sheet). But for the purpose of deriving semantics, most of the listed attributes are all defined to have ASCII case-insensitive values.

The only exception is the type attribute on ol elements, which is always treated case sensitively, but this is not unique to either HTML or XHTML and the attribute is non-conforming anyway, and so it is not relevant for polyglot documents.

I recommend you modify the section to note the case sensitivity of all attribute values for the purpose of selector matching, and recommend but not require the use of lowercase values for all attributes with values that are, enumerated, MIME types, language tags, charsets, boolean, media queries, or keywords.

These are the conforming attributes that have case-insensitive values:

* accept
* accept-charset
* charset
* checked
* defer
* dir
* direction
* disabled
* enctype
* hreflang
* http-equiv
* lang
* media
* method
* multiple
* readonly
* rel (for values that don't contain a colon)
* scope
* selected
* shape
* target (keywords only; browsing context names are case-sensitive)
* type on a, link, object, script, style
* type on input

All the rest of the attributes listed in this section of the current draft are non-conforming.

*Empty Elements*

The HTML5 specification refers to these as void elements in order to
distinguish them from elements that happen to have no content. Please
refer to void elements instead of empty elements here too.

"The alternative syntax <br></br> allowed by XML gives uncertain
results in many existing user agents."

This document should not concern itself with the uncertainty of legacy
browser behaviour. If anything, it should instead note how HTML5
requires </br> to be handled and state that its use is forbidden.

--
Lachlan Hunt - Opera Software
http://lachy.id.au/

http://www.opera.com/

[[

Received on Wednesday, 26 January 2011 23:59:48 UTC