Re: i18n Polyglot Markup/Encodings

Henri Sivonen, Mon, 26 Jul 2010 11:29:59 +0300:
> On Jul 23, 2010, at 01:32, Leif Halvard Silli wrote:
> 
>> Hm. According to ... XML 1.0, fifth edition:
   snip
>> Thus, inferring from the above quotations, it seems like any encoding 
>> is possible, provided one avoids the XML (encoding) declaration and 
>> instead relies on external encoding information, typically HTTP headers.
>> 
>> Do you see any fallacy in this conclusion?
> 
> The conclusion is correct, but it requires defining "polyglot" 
> broadly enough to include the charset parameter of the content type 
> as part of the polyglot data that doesn't vary.

Both XML and HTML5 "includes" HTTP. Thus HTTP is polyglot.

Next question: if one can specify any encoding via HTTP, why forbid any 
encoding inside <meta charset='*'/>? 

And then: Why allow any encoding inside <meta charset='*'/> but not 
allow the XML (encoding) declaration?

> There's one catch though: The pure XML processing model doesn't treat 
> the original encoding of the document as part of the meaningful data 
> encoded by the document. Thus, if the document includes non-ASCII 
> characters in URL query strings, the URL resolves differently in pure 
> XML tooling and in HTML5-compliant UAs. However, if only valid 
> documents are considered, this isn't a problem, because non-ASCII in 
> query strings is already declared non-conforming if the encoding of 
> the document isn't UTF-8 or UTF-16.

Thanks for pointing this out, Thus, in a way, non-UTF-8 and non-UTF-16 
documents becomes a subset of Polyglot Markup - with its own rules.

Btw, what about non-ASCII chars in query strings in UTF-32 encoding 
documents? Shouldn't that work? (UTF-32 is recommended against, but 
still permitted, in HTML5.)
-- 
leif halvard silli

Received on Thursday, 29 July 2010 12:51:49 UTC