Re: i18n Polyglot Markup/Encodings

Henri Sivonen, Thu, 29 Jul 2010 07:19:37 -0700 (PDT):
>> Next question: if one can specify any encoding via HTTP, why forbid
>> any encoding inside <meta charset='*'/>?
> 
> If the meta prescan finds something, the real encoding has to be a 
> rough ASCII superset.
> 
> See https://bugzilla.mozilla.org/show_bug.cgi?id=582788

You misunderstood. Whether one should be permitted to specify UTF-16 
via meta@charset, is not the problem field at discussion. The dilemma 
is as follows:  What kind of inference could make us draw the 
conclusion that <meta charset="windows-1251"/> - but not <meta 
charset="utf-8"/> - should be forbidden in polyglot markup? (Because, 
it is my impression that you think  polyglot markup should permit any 
encoding - with the limitation that only UTF-8 can be specified as the 
encoding via meta@charset.)

Suggested defense for such a view:  meta@charset is not really 
permitted in XHTML, it is, according to HTML5: [1] "only allowed [in 
XHTML] in order to facilitate migration to and from XHTML"

Comment: We can all see that <meta charset="UTF-8"/> can be useful for 
such a migration purpose. But how can <meta charset="windows-1251"/> 
"facilitate migration to and from XHTML"? The answer is: it can't. Not 
anymore than the presence of <?xml version="1.0" 
encoding="WINDOWS-1251" ?> can. (But together - if both are present - 
then they can, when used in tandem, facilitation migration.)

Perhaps HTML5 as well should only permit "UTF-8" as the value of <meta 
charset="*"/>, when present in XHTML? That could solve this dilemma 
when it comes to Polyglot Markup as well! 

	Please file bug, if you think so. 

If <meta charset="windows-1251"/> can't facilitate migration, then it 
can much less be polyglot, one should think ... However, by the letter, 
then <meta charset="windows-1251"/> is permitted in both XHTML5 and 
HTMl5. Thus it is polyglot. OK, <meta charset="windows-1251"/> creates 
some problems - some possibilities for misunderstanding and so on. But 
it is still polyglot - it is still permitted.

>> And then: Why allow any encoding inside <meta charset='*'/> but not
>> allow the XML (encoding) declaration?
> 
> Because there's no parser support for what looks like an XML 
> declaration in text/html.

In polyglot markup, then there would be no more need for text/html 
support for the XML encoding declaration, than there would be need for 
XML support for the meta@charset element. What I suggest as needed is 
that polyglot markup permits that both <meta charset="ISO-8859-1"/> as 
well as <?xml version="1.0" encoding="ISO-8859-1" ?> can be used for 
specifying the encoding, as long as *both* of them are present in the 
same document.

Because, just as much as it is possible to say that meta@charset is 
permitted in XHTML "to facilitate migration to and from XHTML", is it 
also possible to say that the XML encoding declaration should be 
permitted in HTML to facilitate migration to and from HTML. In fact, it 
does not seem true - except when the document is UTF-8 encoded - that 
<meta charset="*"/> facilitates migration to and from XHTML. (And even 
then, as long as the document uses the UTF-8 BOM – which HTML UAs are 
required to support, then <meta charset="*"/> doesn't really facilitate 
anything.)

The true story about facilitating migration between HTML and XHTML is 
that, yes, meta@charset can make this easier for UTF-8. While, when it 
comes to the non-UNICODE encodings, then meta@charset facilitates 
nothing *unless* the XML encoding declaration is also permitted.

> And support isn't going to be added for mere polyglot purity.

There would be no need to add support: In Polyglot Markup, then both 
meta@charset and XML encoding declaration would eventually be present.

[1] Section "4.2.5 The meta element" of HTML5.
-- 
leif halvard silli

Received on Monday, 2 August 2010 00:01:24 UTC