- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Tue, 19 Feb 2013 19:10:59 +0100
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: Alex Russell <slightlyoff@google.com>, Mukul Gandhi <gandhi.mukul@gmail.com>, Jirka Kosek <jirka@kosek.cz>, public-html WG <public-html@w3.org>, Paul Cotton <Paul.Cotton@microsoft.com>, Maciej Stachowiak <mjs@apple.com>, "www-tag@w3.org List" <www-tag@w3.org>, Sam Ruby <rubys@intertwingly.net>, "Michael[tm] Smith" <mike@w3.org>
Henri Sivonen, Tue, 19 Feb 2013 12:32:15 +0200: > On Mon, Feb 18, 2013 at 2:44 AM, Leif Halvard Silli wrote: >> If a non-well-formed HTML document had to be be converted to XHTML >> before being processed, then why not choose to convert to polyglot >> xhtml? > > Because HTML to XHTML conversion can be automated without significant > data loss (if you consider mapping form feed to another space > character insignificant and consider the munging of some identifierst > that have no defined meaning in HTML insignificant) and because you > don't need to convert to polyglot--only XHTML--to process the doc as > XHTML. If the XHTML is used as a temporary intermediate format only, then of course it might not be very important to consider DOM equivalence - it might be enough to consider semantic equivalence. E.g. such a usage would guarantee that <?xml version="1.0" encoding="koi8-r"?> in the temporary format, would be deleted when the document was converted back to HTML. But what about undeclared character entities and @lang vs xml:lang, and stuff like that? How does XML tools deal with them? And what about encoding (see below), more generally? But also, having to first convert to XHTML, and then convert back to HTML, can also be a burden. If you are the owner of those documents, so that you can decide their encoding and everything, then the choice of Polyglot Markup would mean that you did not need to use XHTML as an intermediate formate only. > HTML to polyglot conversion cannot be similarly automated, > because e.g. ampersands in inline scripts are both common and > significant. 1. That is perhaps a reason to change Polyglot Markup from now, where CDATA in <style> and <script> is forbidden, into specific rules for how to use CDATA. [1] I lean towards that. Since we have moved away from HTML4/XHTML1 to HTML5/XHTML, it should be relatively simple to define the Polyglot Markup rules for CDATA in style and script. 2. Does Polyglot Markup have to be a all-or-nothing question? Why not say that automatic conversion should convert documents to Polyglot Markup as often as possible - but not always. For instance, for a KOI8-R encoded document, it would be impossible to convert it to Polyglot Markup without touching the encoding. 3. That said: if the Polyglot Markup format would only be a temporary format, then it did perhaps not need to be a problem that that the temporary format was UTF-8? [1] http://lists.w3.org/Archives/Public/public-html/2013Feb/0173 -- leif halvard silli
Received on Tuesday, 19 February 2013 18:11:38 UTC