Re: i18n Polyglot Markup/Encodings

On Jul 23, 2010, at 01:32, Leif Halvard Silli wrote:

> Hm. According to section F.1 "Detection Without External Encoding 
> Information" of XML 1.0, fifth edition:
> 
> 	]] […] each XML entity not accompanied by external encoding 
> information and not in UTF-8 or UTF-16 encoding must begin with an XML 
> encoding declaration […] [[
> 
> And in the same spec, section 4.3.3 "Character Encoding in Entities":
> 
> 	]] In the absence of external character encoding information (such as 
> MIME headers), parsed entities which are stored in an encoding other 
> than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The 
> Text Declaration) containing an encoding declaration: [[
> 
> Thus, inferring from the above quotations, it seems like any encoding 
> is possible, provided one avoids the XML (encoding) declaration and 
> instead relies on external encoding information, typically HTTP headers.
> 
> Do you see any fallacy in this conclusion?

The conclusion is correct, but it requires defining "polyglot" broadly enough to include the charset parameter of the content type as part of the polyglot data that doesn't vary.

There's one catch though: The pure XML processing model doesn't treat the original encoding of the document as part of the meaningful data encoded by the document. Thus, if the document includes non-ASCII characters in URL query strings, the URL resolves differently in pure XML tooling and in HTML5-compliant UAs. However, if only valid documents are considered, this isn't a problem, because non-ASCII in query strings is already declared non-conforming if the encoding of the document isn't UTF-8 or UTF-16.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Monday, 26 July 2010 08:30:41 UTC