- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Wed, 3 Jun 2009 10:24:08 +0300
- To: Ira McDonald <blueroofmusic@gmail.com>
- Cc: "public-html@w3.org WG" <public-html@w3.org>, www-international@w3.org, Harley Rosnow <Harley.Rosnow@microsoft.com>, Travis Leithead <Travis.Leithead@microsoft.com>
On Jun 2, 2009, at 19:17, Leif Halvard Silli wrote: > Like several others, your reply do not incorporate the authoring > tools perspective that Larry contributed to this thread[1]. The thread started about default in consuming. *Of course* authoring tools should use UTF-8 *and declare it* for any new documents. HTML5 already says: "Authors are encouraged to use UTF-8." http://www.whatwg.org/specs/web-apps/current-work/#charset On Jun 2, 2009, at 18:54, Ira McDonald wrote: > I suggest that claiming conformance to HTML5 means that you MUST > always > supply an explicit charset declaration on the Content-Type line - no > confusion > at all for older browsers and content management systems. I've previously argued that conformance should require an explicit encoding declaration (BOM counting as explicit declaration). However, my suggestion was rejected because there's no real interop issue with HTML files that contain only ASCII bytes and it would be inconvenient to have to declare the encoding in small ASCII-only test cases. My counter-argument is that it's useful for a validator to whine in the ASCII-only case, because the validator user may be testing a CMS template that is ASCII-only at the time of testing but gets filled with arbitrary content at deployment time. Anyway, as it stands, HTML5 *requires* the encoding to be declared in order for the document to be valid if the byte stream has *non-ASCII* bytes. Validator.nu issues an error in that case. Validator.nu issues a warning in the ASCII-only case. Quoting the spec for reference: > If an HTML document does not start with a BOM, and if its encoding > is not explicitly given by Content-Type metadata, then the character > encoding used must be an ASCII-compatible character encoding, and, > in addition, if that encoding isn't US-ASCII itself, then the > encoding must be specified using a meta element with a charset > attribute or a meta element in the Encoding declaration state. http://www.whatwg.org/specs/web-apps/current-work/#charset On Jun 2, 2009, at 20:27, Geoffrey Sneddon wrote: > One possibility would be to just say something like, "conforming > documents MUST be encoded as UTF-8 and declare themselves to be so". > > The biggest problem I see with that is that from an RFC 2119 point > of view using UTF-8 isn't required for interoperability (for a > start, UAs are required to support Windows-1252 as well). That's one problem. There are three other issues with making non-UTF-8- encodings non-conforming: 1) Making UTF-16 non-conforming would open up a rathole about whether UTF-16 is "more efficient" for "some languages" ignoring the proportion of Basic Latin-range characters in markup. 2) Making GB18030 non-conforming would open up another rathole we'd be better off not venturing into. 3) Making validators whine about non-UTF-8 encodings any more than what Validator.nu does now would likely make HTML5 validation so annoying for authors who are upgrading existing sites as to make them ignore validation and thereby rendering the requirement irrelevant. Validator.nu warns about 'obscure' encodings where obscure is defined as not widely supported based on my investigation of Firefox, IE, Opera, Safari, Sun Java and Python. The encodings that are not 'obscure' are listed in the Encoding popup at http://validator.nu/. I'd be OK with upgrading the obscure encoding warnings to errors if the HTML WG really wants to get deeper into blessing and shunning encodings. However, given that UAs will have to support the non- obscure encoding far into the future, making them errors would be unproductive despite their adverse effects on form submission and the query parts of URLs. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Wednesday, 3 June 2009 07:24:53 UTC