Re: i18n comments on Polyglot Markup from Leif Halvard Silli on 2010-07-15 (public-html@w3.org from July 2010)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Thu, 15 Jul 2010 23:57:49 +0400
To: Edward O'Connor <hober0@gmail.com>
Cc: Sam Ruby <rubys@intertwingly.net>, Anne van Kesteren <annevk@opera.com>, Richard Ishida <ishida@w3.org>, public-html@w3.org
Message-ID: <20100715235749359310.0607d23a@xn--mlform-iua.no>

Edward O'Connor, Thu, 15 Jul 2010 12:04:45 -0700:
> Sam Ruby:
>> We could also go a different way entirely, and say that polyglot documents
>> are a subset of both HTML5 and XHTML5, and the subset that we select 
>> is only
>> utf-8.
> 
> I think this is the way to go. I suspect 90+% of the usefulness of the
> polyglot spec is its usefulness as a best practice style guide for
> people producing HTML content, and <10% as an additional means for
> allowing people to use XML toolchains. Always using UTF-8 is such a
> best practice.

I agree that it possible to come to the conclusion that Polyglot Markup 
should be based on UTF-8 from two angles: We can say that this is a 
spec of its own, and, based on that - say that we decide the spec 
rules. Or we can treat Polyglot Markup as a best practice document, and 
rule that UTF-8 is the best practice. (We then also make polyglot 
syntax as such as a kind of best practice, too,  I think.)

My attitude when filing bugs etc against Polyglot Markup, has been more 
or less based on what Henri described in a bug comment - from memory: 
Polyglot Markup should be a common denominator of the XML spec and the 
HTML spec. And I find no reason for forbidding UTF-16 in Polyglot 
Markup whether in the XML and HTML5 specs.

But even if we say that "we decide the rules", I still think that 
Polyglot Markup needs some additional principle in order to justify 
UTF-8 as the sole encoding. From memory, HTML5's recommendation of 
UTF-8 is related to URLs and form handling. And those are also, I 
guess, reasons for preferring UTF-8 in a XHTML document - the 
consequences of an invalid character in XHTML are only more draconian 
than they are in HTML ... Else the issue is the same.

I could think of the following justification, then: Due to the often 
more draconian consequences of malformed characters in XHTML, it is 
recommended/required to use UTF-8, as UTF-8 diminishes chances for 
malformed characters in forms etc. In other words: in order to be more 
compatible with HTML, then a polyglot served/parsed as XHTML needs to 
be served as UTF-8, to diminish the possibility that a polyglot parsed 
as XHTML becomes more inaccessible (due to malformed forms input) than 
the same polyglot would be when parsed as HTML.

Does this make sense to anyone?
-- 
leif h silli

Received on Thursday, 15 July 2010 19:59:18 UTC