W3C home > Mailing lists > Public > public-html-bugzilla@w3.org > January 2011

[Bug 11909] The principles of Polyglot Markup - validity? well-formed? DOM-equality?

From: <bugzilla@jessica.w3.org>
Date: Sun, 30 Jan 2011 00:42:43 +0000
To: public-html-bugzilla@w3.org
Message-Id: <E1PjLNL-0008Re-5l@jessica.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=11909

--- Comment #6 from Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no> 2011-01-30 00:42:42 UTC ---
(In reply to comment #5)
> == 7.  Internationalization ==

> Polyglot Markup should use UTF-8, for such and such reasons:

(A)  I now believe that the exact, permitted encodings should be a conformance
issue.
   That way we can solve the issue that several - including Sam -  seems to be
wanting to say that only UTF-8 should/must be used. (Plust that we solve _my_
problem: I think it would be stupid to say that polyglot markup, per
definition, needs to be UTF-8. I'm fine with limiting HTML5-compatibel
documetns to UTF-8, as long as the rule is founded in something solid.)

    Thus the principles section should only  give general consideration - e.g.
it can say that if you sets the encoding with a meta element, then you must
also set the encoding with XML declaration, except when the encoding is UTF-8 
(and UTF-16).

    The status of the HTML5 spec is that it permits <meta charset="UTF-8"/>
inside the XHTML syntax only when the value of @charset is UTF-8. And it also
forbids the use of the use of  the XML declaration. Thus,  section 2, about
HTML5-conformatance, should demand UTF-8.

(B) Regarding the general rules: we need to consider that HTML5/HTML parsers
have encoding detection algorithm(s). Polyglot Markup must be authored in such
a way that HTML5's encoding detection algorithm doesn't run (at least does not
more more than to the step wher there is a <meta charset> element in the start
of the doc). This rules needs to be in place in order to equalize both the DOMs
and the general experience. If the algorithm runs longer than that, then, in
HTML5, the page can be redrawn, the encoding change during the actual parsing 
and so on. 

I would also like to add a 10th point to the principles:

== 10. Authoring equality ==

Polyglots should be possible to author using both HTML tools and XML tools. And
authoring is, in this case, understood as working on a single file - not in a
CMS but in a file system. 

The practical consequence of this is that if you use other encodings than
UTF-8/UTF-16 and also if you don't use the BOM, then there *must* be  a
encoding declarations inside the document. (This in turns, leads us to say
that, for HTML5-conforming documents, then only UTF-8 (and perhasp UTF-16 -
must think) is permitted.

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.
Received on Sunday, 30 January 2011 00:42:44 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 30 January 2011 00:43:03 GMT