Re: HTML/XML TF Report glosses over Polyglot Markup (Was: Statement why the Polyglot doc should be informative) from Robin Berjon on 2012-12-03 (public-html@w3.org from December 2012)

From: Robin Berjon <robin@w3.org>
Date: Mon, 03 Dec 2012 11:35:38 +0100
To: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
CC: Henri Sivonen <hsivonen@iki.fi>, public-html WG <public-html@w3.org>, www-tag@w3.org
Message-ID: <50BC807A.7070108@w3.org>
Hi Leif,

speaking as a participant in the TF, but not for it.

On 30/11/2012 20:10 , Leif Halvard Silli wrote:
> I did not take part in the HTML/XML Task Force. But I am critical about
> what the report says (very little!) about polyglot markup.  Here are my
> comments on that report, from that angle:
>
> Regarding "2.1 How can an XML toolchain be used to consume HTML?"
>            http://www.w3.org/TR/html-xml-tf-report/#uc01,
>   TF says: In the problem refinement, the TF go astray, by replacing
>            "HTML" with "Web" (how can an XML toolchain be used to
>            consume the Web), quote: "HTML is not guaranteed (or even
>            likely, […] to be well-formed". As soon as you replaced
>            HTML with Web, then Polyglot Markup in reality went out
>            the window. With that problem description, the only role
>            of Polyglot Markup becomes as  *output format* for
>            the bespoke toolchain, but that use, is never discussed.
>   Verdict: W.r.t. Polyglot Markup, section 2.1 mixes up the arguments

I think that the problem statement is very clear here. XML has some 
interesting tools to process content, and given the abstract 
similarities with HTML it would be beneficial to be able to apply to both.

Case in point: we have a few large HTML datasets at hand which we can 
use to look at how HTML is used in the wild. But for the most part we 
tend to be limited to grepping, to some simple indexing, or to parsing 
them all and applying some ad hoc rules to extract data (which is slow). 
It would be sweet to be able to just load those into a DB and run XQuery 
(or something like it) on them. If that were possible, you'd get very 
fast, high octane analysis with off-the-shelf software (a lot of it open 
source).

Saying "polyglot" here just doesn't help: very little real-world content 
uses it. Note that the section clearly looks at polyglot and gives a 
clear reason for not using it in this case.

> Regarding "2.2 How can an HTML toolchain be used to consume XML?"
>            http://www.w3.org/TR/html-xml-tf-report/#uc02
>   TF says: "the most successful approach may be to simply translate
>            the XML to HTML5 before passing it to the HTML5 tool"
>   Verdict: How come this section didn't evaluate Polyglot Markup?

"Processing a real XML document with an HTML5 parser is probably never 
going to be possible with complete fidelity." In general it's not a 
problem you can solve. And polyglot (rightfully IMHO) doesn't even try.

> Regarding "2.3 How can islands of HTML be embedded in XML?"
>            http://www.w3.org/TR/html-xml-tf-report/#uc03
>   TF says: EITHER, create HTML as "well-formed XML" = "requirements
>            on the author" OR absolve the author by (having the tool)
>            escaping markup.
>   Verdict: How come you didn't mention having the tool output
>            Polyglot Markup?

It pretty much says either use XHTML (in which case you don't need 
polyglot) or embed the HTML as text (in which case you don't need 
polyglot). Recommending polyglot here would depend too much on the 
specifics of the usage, and in general wouldn't help.

> Regarding "2.4 How can islands of XML be embedded in HTML?"
>            http://www.w3.org/TR/html-xml-tf-report/#uc04
>   TF says: Use <script> as XML container and use JavaScript to make it
>            render in the DOM.
>   Verdict: It seems like Polyglot Markup does not discuss that approach.
>            If the TF document had purported to be an evaluation of
>            Polyglot Markup, you would have discussed it.
>      Also: I don't understand the last sentence: "Note also that
>            polyglot markup is not an aid here as it forbids arbitrary
>            XML content from the document." Does it? It doesn't any
>            more than HTML5 proper does: If you add something that
>            HTML5 doesn't permit, then it isn't HTML5 any more but
>            "extended  HTMl5". But clearly, it is possible to create
>            "extended polyglot markup" - just apply its principles.

That section's advice is mostly missing a mention of the pitfalls of 
</script> IMHO. Including XML in <script> is definitely *not* something 
that polyglot should recommend since you'd get very different DOMs on 
either side. It's a useful technique when you know you'll be parsed as 
HTML — and therefore clearly outside polyglot.

> Regarding "2.5 How can XML be made more forgiving of errors?"
>            http://www.w3.org/TR/html-xml-tf-report/#uc05
>   TF says: XML5, error handling in XML etc.
>   Verdict: Provided that the goal of the task force (improved
>            "interoperability between HTML and XML") could be
>            be helped by making XML fail in the exact way that
>            HTML fails, then why did you not discuss Polyglot
>            Markup as an option here?

Because looking a potential future changes to XML is completely outside 
the scope of polyglot. It's also completely different from polyglot's goals.

>   Verdict: The idea that this HTML parser could produce polyglot markup
> (and no: not in order to pee in the tag soup ocean, but in order to be
> a more useful parser in that tool chaing!), is never discussed.

I'm not even sure what it would mean for an HTML parser to produce 
polyglot markup.

> Over all, the report is trapped in some well known dichotomies. And
> Polyglot Markup is not considered in a serious way. The Task Force's
> report is a very thin basis for rescinding the request for robust,
> polyglot markup.

Actually, we considered polyglot seriously. We found polyglot to be 
useful for the uses it was designed for, but not applicable to all cases 
in which XML/HTML interoperability is desirable.

-- 
Robin Berjon - http://berjon.com/ - @robinberjon
Received on Monday, 3 December 2012 10:36:36 UTC