Re: Polyglot Markup Formal Objection Rationale from Leif Halvard Silli on 2012-11-06 (public-html@w3.org from November 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 6 Nov 2012 13:30:19 +0100
To: Smylers <Smylers@stripey.com>
Cc: public-html@w3.org
Message-ID: <20121106133019548309.34dabd36@xn--mlform-iua.no>
Smylers, Tue, 6 Nov 2012 10:52:56 +0000:
> Jirka Kosek writes:
>> On 5.11.2012 15:04, Smylers wrote:

Regarding how "UTF-8" fits with the thoughts behind Polyglot Markup:

>>>> For example as both in HTML5 and in XML you have some variety in
>>>> choosing encoding, Polyglot must *normatively* define that only
>>>> allowed encoding is UTF-8.
>>> 
>>> It can do that by reference; it doesn't need to so it explicitly.
>>> Clearly by the definition polyglot HTML (being the overlap of text/html
>>> and XHTML) a conforming polyglot document needs to use an encoding
>>> which:
>>> 
>>> * Is allowed in conforming text/html.
>>> * Is allowed in conforming XHTML.
>>> * Can be declared in a way which is conforming in both representations,
>>>   and has the same meaning in both.
>>> 
>>> If the only encoding that turns out to meets those requirements is
>>> UTF-8 then it necessarily follows that polyglot HTML documents must
>>> use UTF-8. Saying "Polyglot HTML documents use UTF-8" is therefore a
>>> description of a fact, and not itself a requirement; it places no
>>> further restrictions on those already made by the simple definition
>>> of what polyglot HTML is.
>>> 
>>> If, on the other hand, it turns out there is some other encoding
>>> which also meets the above criteria then that would be an example of
>>> a contradiction between polyglot HTML being a simple profile of the
>>> overlap between text/html and XHTML and it having its own normative
>>> requirements. 
>> 
>> Well, actually your logic would allow either UTF-8 or UTF-16 encodings
> 
> Not "my" logic, but the outcome of the definition of polyglot HTML being
> mark-up that can be processed with identical meanings as both text/html and
> XHTML.

For the record: As long as one relies on external encoding declaration, 
then it would be possible to use *any* legacy encoding. However, such a 
thing would be quite cumbersome to deal with, e.g. during authoring.

Sam recommended early on that Polyglot Markup only support UTF-8. He 
also used the expression "HTML with helmets on" about Polyglot Markup. 
But despite of that, Polyglot Markup initially supported UTF-16 too. 
The reason being along the lines that you argue above.

The justification I used in the bug I filed to make it only support 
UTF-8 was that the spec texts of HTML5 and XML only has UTF-8 as common 
encoding since HTML UAs are only required to support UTF-8 and 
ISO-8859-1, whereas XML UAs are only required to support UTF-16 and 
UTF-8. (Either may support more encodings, but these are the only 
required once.) This justification was looked upon and accepted by the 
I18N working group. The I18N group were also opposed to any preference 
be given to the use of the BOM as the "most polyglot" way to declare 
the encoding - this because (I gather) that they are in favor of the 
encoding being visibly declared in the markup (which seems like a good 
'with helmets on' principle, when on thinks about it). But from a HTML5 
point of view, we then stumble upon the fact that it is forbidden to 
declare the UTF-16 encoding. 

Also, now that the Encoding Standard is gaining attention, I will note 
that it says that new formats should use UTF-8 exclusively. One could 
also add that the Encoding Standard - and HTML5 - understands "UTF-16" 
to default to UTF-16-LE if there is no BOM, whereas XML 1.0 and XML 
editors might be in (temporary) conflict with Encoding Standard on that 
point.

So, as they say: All this taken together = UTF-8.

May be the current principle paragraphs could emphasize more strongly 
not the DOM side of things (that is strong enough, I think), but the 
the fact that polyglot markup should also be a "spec subset" - a 
"textual"/syntactic subset - of what HTML5 and XHTML5 allows. And 
*maybe* the principles should also add a word about the *positive 
goals* of Polyglot Markup. I man: To say that it is a mathematical 
subset of XHTML5 and HTML5 is, at best, just a boring fact. It would be 
good if it also presented some of the benefits intended by the 
specification of Polyglot Markup.

>> But in usual standards meaning profile is clearly defined subset and
>> such subset can define additional requirements like allowing only
>> UTF-8 in order to make interop easier.
> 
> The Polyglot spec doesn't claim to be a profile; the word "profile" does
> not appear anywhere in it.

May be it would be a good thing to include 'profile', somewhere, yes!
-- 
leif halvard silli
Received on Tuesday, 6 November 2012 12:30:57 UTC