Re-review of the Task Force report from Henri Sivonen on 2011-08-05 (public-html-xml@w3.org from August 2011)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 05 Aug 2011 14:24:17 +0300
To: public-html-xml@w3.org
Message-ID: <1312543457.2069.33.camel@shuttle>
As promised, here's my re-review of the TF report. My apologies for
being over a week late with this.

-

s/The principle impedement/The principal impediment/

-

"Even this is not a 100% solution as is still possible to encounter HTML
documents that cannot be represented perfectly in XML."

I suggest downplaying the severity of this a bit:
"It is still possible to encounter HTML documents whose document tree
needs to be modified slightly for the document tree to be representable
as XML. For conforming input, the modifications are on the level of
replacing form feeds with spaces."

-

"HTML5 toolchains are widespread and popular."

I think it would be prudent to drop "5" from that sentence at this time.

-

s/conent/content/

-

s/difficlut/difficult/

-

", combining the resulting DOMs through some other process"

I'd drop the above-quoted words, since chances are that environments
that use both HTML and XML can do so just fine without combining them
into one tree at any time.

-

s/intrinsicly/intrinsically/

-

"There are still details of implementation to be considered in the case
where HTML5 is represented with well-formed XML. Is the markup to be
“clipped out” and handed to an HTML5 parser, or is the entire XML DOM
going to be handed to the HTML5 engine?"

Of these two approaches, clipping out the markup and handing it to an
HTML5 parser is clearly not a correct implementation. It would be a
layering violation (access to XML source from the layer that processes
the XML tree and identifies which part is to be clipped out) and
wouldn't work in the general case (most obviously when XHTML element
names are prefixed but there are other issues).

The correct solution is extracting the HTML subtree and passing it to an
HTML engine if there's a tree input interface or serializing it as HTML
and passing to an HTML parser if there's only a source text-based
interface. (Or if the HTML subsystem support XML parsing but not tree
input, serializing the extracted subtree as XML.)

I suggest rewriting the paragraph like this:
"If the HTML subsystem has an interface that allows document trees to be
passed to it, the XHTML subtree should be extracted from the larger XML
tree and passed to the HTML subsystem. If the HTML subsystem only
accepts HTML source text as its input, the XHTML subtree needs to be
serialized as HTML and passed to the HTML subsystem for parsing using an
HTML parser. In the latter case, some non-conforming constructs may not
round-trip to the same tree shape when serialized as HTML and reparsed
as HTML. Also, conforming trees that have tr elements as children of
table elements will be replaced with semantically equivalent but
tree-wise different construct where there the tr elements gain a tbody
parent which is a child of the table."

-

"A third solution is to process the compound messages using MIME
multipart/related semantics, perhaps through facilities such as [MTOM]
or [XOP]. This is very much like the escaped markup case where
downstream processing must be sophisticated enough to reconstruct the
authors intent."

This isn't really putting HTML inside an XML document, is it? It looks
to me that it's putting both XML and HTML inside a third format. I
suggest removing this paragraph.

-

My regrets for my unavailability over the next three weeks. Please
forward the report to the TAG on the planned schedule without waiting
for me to agree or disagree with how you chose to handle the above
feedback.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Friday, 5 August 2011 11:24:51 UTC