RE: i18n comments on Polyglot Markup

> From: Sam Ruby [mailto:rubys@intertwingly.net]
> Sent: 15 July 2010 19:43
...
> > UTF-16 encoded XML documents, on the other hand, must start with a
> > BOM, see http://www.w3.org/TR/REC-xml/#charencoding  When the doc is
> > treated as XML, however, the meta element is ignored.
> >
> > Do you agree?
> 
> I disagree that UTF-16 encoded XML document must start with a BOM.  See:
> 
> http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info
> 
> My personal experience (which now may be dated) is that there are a
> number of XML parsers that choke in the presence of a BOM.  But even if
> we ignore such, there still are quite a few ways to go given this set of
> information.

Although that section includes methods of detecting utf-16 as well as other
encodings, I'm basing my conclusion on the following text:

"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY  begin
with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000],
section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF).
This is an encoding signature, not part of either the markup or the
character data of the XML document. XML processors MUST  be able to use this
character to differentiate between UTF-8 and UTF-16 encoded documents."
(http://www.w3.org/TR/REC-xml/#charencoding )

Which seems pretty clear.  I will check with some XML gurus.

> 
> It turns out that <meta charset="utf-16"/> will always be ignored but
> the content processed correctly if the content is correctly encoded as
> utf-16.  I gather that Richard would prefer that such elements not be
> treated as conformance errors, whereas Ian would prefer that such
> elements be treated as conformance errors.

I have two reasons for my preference.

[1] some people will probably add meta elements when using utf-16 encoded
documents, and there's not any harm in it that I can see, so no real reason
to penalise them for it.

[2] i18n folks have long advised that you should always include a visible
indication of the encoding in a document, HTML or XML, even if you don't
strictly need to, because it can be very useful for developers, testers, or
translation production managers who want to visually check the encoding of a
document.


> 
> We could also go a different way entirely, and say that polyglot
> documents are a subset of both HTML5 and XHTML5, and the subset that we
> select is only utf-8.  I mention this as this is my personal
> recommendation on the matter, but I can live either of the other two
> alternatives mentioned above.

While it would be wonderful to live in a world where only utf-8 encodings
are allowed, I'm not sure we can do that.  I think we need to acknowledge
that these are XML documents, and although we certainly constrain the
vocabulary, I'm leery about messing with the right of people to use other
encodings if they insist. 

(On the other hand, it would certainly solve a lot of problems as long as
people really do understand what utf-8 is and are able to save, store and
process stuff in that as easily as they currently save, store and process
their current XHTML.  Btw, do we have any figures about what percentage of
the XHTML out there currently uses utf-8?)


RI

Received on Thursday, 15 July 2010 20:01:52 UTC