[Bug 12242] New: Make UTF-16 an invalid encoding in Polyglot Markup from bugzilla@jessica.w3.org on 2011-03-05 (public-html@w3.org from March 2011)

From: <bugzilla@jessica.w3.org>
Date: Sat, 05 Mar 2011 02:39:04 +0000
To: public-html@w3.org
Message-ID: <bug-12242-2495@http.www.w3.org/Bugs/Public/>

http://www.w3.org/Bugs/Public/show_bug.cgi?id=12242

Summary: Make UTF-16 an invalid encoding in Polyglot Markup
Product: HTML WG
Version: unspecified
Platform: PC
URL: http://dev.w3.org/html5/html-xhtml-author-guide/html-x
html-authoring-guide.html#character-encoding
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: HTML/XHTML Compatibility Authoring Guide (ed: Eliot
Graff)
AssignedTo: eliotgra@microsoft.com
ReportedBy: xn--mlform-iua@xn--mlform-iua.no
QAContact: public-html-bugzilla@w3.org
CC: mike@w3.org, public-html-wg-issue-tracking@w3.org,
public-html@w3.org, eliotgra@microsoft.com

* According to HTML5, HTML-parsers must as minimum support UTF-8
and Windows-1252.
http://dev.w3.org/html5/spec/parsing.html#character-encodings-0
* While according to XML, XML-parsers must as mininum support UTF-8
and UTF-16.
* Polyglot Markup, though, "prefers" UTF-8 (based on HTML5's UTF-8
preference, one should think), but else follows the XML approach and
permits both UTF-8 or UTF-16.

AS A RESULT, it becomes possible to author "polyglot markup" that works fine in
XML-parsers, but which isn't required to work in all and any HTML-parser.

We should not declare mark-up that isn't required to work in a HTML-parser as
"polyglot markup". Hence we should conclude that UTF-16 should not be a
recommended encoding for Polyglot Markup.

Discussion:

* It was suggested early on, e.g. by e.g. Sam Ruby, that UTF-8 should be the
only recommended encoding for polyglot markup. And this can be a very useful
suggestion. For instance, it would become a very useful way to "force" many
HTML editing programs to default to UTF-8, one should think. It also meets
HTML5 which says that new documents SHOULD default to UTF-8.

* However, the problem is to justify *exclusion* of UTF-16 by inference from
the specs. Because, the use of UTF-16 does not seem to break with the
principles behind Polyglot Markup, as laid out in its introduction:

http://dev.w3.org/html5/html-xhtml-author-guide/html-xhtml-authoring-guide.html#introduction

* Permission to use UTF-16 in polyglot markup is logical, for instance because

- UTF-16 can be reliably detected via the BOM, in both XMLand HTML5
- though HTML5 says that, quote: "Using non-UTF-8 encodings can
have unexpected results on form submission and URL encodings,
which use the document's character encoding by default", the use of
non-UTF-8 probably creates form problems in XML-on-the-web
too. Thus XML and HTML are probaly in same boat here - and
hence it does not seem logical to use against UTF-16 that some
form submission problems could occur.

* That said, the problems with non-UTF-8 *should* carry *some*
weight: e.g. those form submission problems could cause greater
problems in XML and it is a small irriation that it is not permitted/
possible to use an explicit character declaration in UTF-16 encoded
documents.

However, the fact that HTML-parsers aren't required to support UTF-16, is a
more fundamental nail in the coffin.

Can it have any real-world effect? Not so much when it comes to "big" browsers
- they support multiple encodings. But for "simpllistic" parsers of differnent
kinds, it could probably have an effect.

--
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

Received on Saturday, 5 March 2011 02:39:06 UTC