- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Tue, 6 Nov 2012 13:30:19 +0100
- To: Smylers <Smylers@stripey.com>
- Cc: public-html@w3.org
Smylers, Tue, 6 Nov 2012 10:52:56 +0000: > Jirka Kosek writes: >> On 5.11.2012 15:04, Smylers wrote: Regarding how "UTF-8" fits with the thoughts behind Polyglot Markup: >>>> For example as both in HTML5 and in XML you have some variety in >>>> choosing encoding, Polyglot must *normatively* define that only >>>> allowed encoding is UTF-8. >>> >>> It can do that by reference; it doesn't need to so it explicitly. >>> Clearly by the definition polyglot HTML (being the overlap of text/html >>> and XHTML) a conforming polyglot document needs to use an encoding >>> which: >>> >>> * Is allowed in conforming text/html. >>> * Is allowed in conforming XHTML. >>> * Can be declared in a way which is conforming in both representations, >>> and has the same meaning in both. >>> >>> If the only encoding that turns out to meets those requirements is >>> UTF-8 then it necessarily follows that polyglot HTML documents must >>> use UTF-8. Saying "Polyglot HTML documents use UTF-8" is therefore a >>> description of a fact, and not itself a requirement; it places no >>> further restrictions on those already made by the simple definition >>> of what polyglot HTML is. >>> >>> If, on the other hand, it turns out there is some other encoding >>> which also meets the above criteria then that would be an example of >>> a contradiction between polyglot HTML being a simple profile of the >>> overlap between text/html and XHTML and it having its own normative >>> requirements. >> >> Well, actually your logic would allow either UTF-8 or UTF-16 encodings > > Not "my" logic, but the outcome of the definition of polyglot HTML being > mark-up that can be processed with identical meanings as both text/html and > XHTML. For the record: As long as one relies on external encoding declaration, then it would be possible to use *any* legacy encoding. However, such a thing would be quite cumbersome to deal with, e.g. during authoring. Sam recommended early on that Polyglot Markup only support UTF-8. He also used the expression "HTML with helmets on" about Polyglot Markup. But despite of that, Polyglot Markup initially supported UTF-16 too. The reason being along the lines that you argue above. The justification I used in the bug I filed to make it only support UTF-8 was that the spec texts of HTML5 and XML only has UTF-8 as common encoding since HTML UAs are only required to support UTF-8 and ISO-8859-1, whereas XML UAs are only required to support UTF-16 and UTF-8. (Either may support more encodings, but these are the only required once.) This justification was looked upon and accepted by the I18N working group. The I18N group were also opposed to any preference be given to the use of the BOM as the "most polyglot" way to declare the encoding - this because (I gather) that they are in favor of the encoding being visibly declared in the markup (which seems like a good 'with helmets on' principle, when on thinks about it). But from a HTML5 point of view, we then stumble upon the fact that it is forbidden to declare the UTF-16 encoding. Also, now that the Encoding Standard is gaining attention, I will note that it says that new formats should use UTF-8 exclusively. One could also add that the Encoding Standard - and HTML5 - understands "UTF-16" to default to UTF-16-LE if there is no BOM, whereas XML 1.0 and XML editors might be in (temporary) conflict with Encoding Standard on that point. So, as they say: All this taken together = UTF-8. May be the current principle paragraphs could emphasize more strongly not the DOM side of things (that is strong enough, I think), but the the fact that polyglot markup should also be a "spec subset" - a "textual"/syntactic subset - of what HTML5 and XHTML5 allows. And *maybe* the principles should also add a word about the *positive goals* of Polyglot Markup. I man: To say that it is a mathematical subset of XHTML5 and HTML5 is, at best, just a boring fact. It would be good if it also presented some of the benefits intended by the specification of Polyglot Markup. >> But in usual standards meaning profile is clearly defined subset and >> such subset can define additional requirements like allowing only >> UTF-8 in order to make interop easier. > > The Polyglot spec doesn't claim to be a profile; the word "profile" does > not appear anywhere in it. May be it would be a good thing to include 'profile', somewhere, yes! -- leif halvard silli
Received on Tuesday, 6 November 2012 12:30:57 UTC