Re: Polyglot markup and authors from Leif Halvard Silli on 2013-02-19 (www-tag@w3.org from February 2013)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 19 Feb 2013 19:10:59 +0100
To: Henri Sivonen <hsivonen@iki.fi>
Cc: Alex Russell <slightlyoff@google.com>, Mukul Gandhi <gandhi.mukul@gmail.com>, Jirka Kosek <jirka@kosek.cz>, public-html WG <public-html@w3.org>, Paul Cotton <Paul.Cotton@microsoft.com>, Maciej Stachowiak <mjs@apple.com>, "www-tag@w3.org List" <www-tag@w3.org>, Sam Ruby <rubys@intertwingly.net>, "Michael[tm] Smith" <mike@w3.org>
Message-ID: <20130219191059535278.f7de8e31@xn--mlform-iua.no>

Henri Sivonen, Tue, 19 Feb 2013 12:32:15 +0200:
> On Mon, Feb 18, 2013 at 2:44 AM, Leif Halvard Silli wrote:

>> If a non-well-formed HTML document had to be be converted to XHTML
>> before being processed, then why not choose to convert to polyglot
>> xhtml?
> 
> Because HTML to XHTML conversion can be automated without significant
> data loss (if you consider mapping form feed to another space
> character insignificant and consider the munging of some identifierst
> that have no defined meaning in HTML insignificant) and because you
> don't need to convert to polyglot--only XHTML--to process the doc as
> XHTML.

If the XHTML is used as a temporary intermediate format only, then of 
course it might not be very important to consider DOM equivalence - it 
might be enough to consider semantic equivalence. E.g. such a usage 
would guarantee that <?xml version="1.0" encoding="koi8-r"?> in the 
temporary format, would be deleted when the document was converted back 
to HTML.

But what about undeclared character entities and @lang vs xml:lang, and 
stuff like that? How does XML tools deal with them? And what about 
encoding (see below), more generally?

But also, having to first convert to XHTML, and then convert back to 
HTML, can also be a burden. If you are the owner of those documents, so 
that you can decide their encoding and everything, then the choice of 
Polyglot Markup would mean that you did not need to use XHTML as an 
intermediate formate only.

> HTML to polyglot conversion cannot be similarly automated,
> because e.g. ampersands in inline scripts are both common and
> significant.

1. That is perhaps a reason to change Polyglot Markup from now, where 
CDATA in <style> and <script> is forbidden, into specific rules for how 
to use CDATA. [1] I lean towards that. Since we have moved away from 
HTML4/XHTML1 to HTML5/XHTML, it should be relatively simple to define 
the Polyglot Markup rules for CDATA in style and script.

2. Does Polyglot Markup have to be a all-or-nothing question? Why not 
say that automatic conversion should convert documents to Polyglot 
Markup as often as possible - but not always. For instance, for a 
KOI8-R encoded document, it would be impossible to convert it to 
Polyglot Markup without touching the encoding.

3. That said: if the Polyglot Markup format would only be a temporary 
format, then it did perhaps not need to be a problem that that the 
temporary format was UTF-8?

[1] http://lists.w3.org/Archives/Public/public-html/2013Feb/0173
-- 
leif halvard silli

Received on Tuesday, 19 February 2013 18:11:36 UTC