Re: Auto-detect and encodings in HTML5

On Jun 2, 2009, at 19:17, Leif Halvard Silli wrote:

> Like several others, your reply do not incorporate the authoring  
> tools perspective that Larry contributed to this thread[1].


The thread started about default in consuming. *Of course* authoring  
tools should use UTF-8 *and declare it* for any new documents.

HTML5 already says: "Authors are encouraged to use UTF-8." http://www.whatwg.org/specs/web-apps/current-work/#charset

On Jun 2, 2009, at 18:54, Ira McDonald wrote:

> I suggest that claiming conformance to HTML5 means that you MUST  
> always
> supply an explicit charset declaration on the Content-Type line - no  
> confusion
> at all for older browsers and content management systems.


I've previously argued that conformance should require an explicit  
encoding declaration (BOM counting as explicit declaration). However,  
my suggestion was rejected because there's no real interop issue with  
HTML files that contain only ASCII bytes and it would be inconvenient  
to have to declare the encoding in small ASCII-only test cases.

My counter-argument is that it's useful for a validator to whine in  
the ASCII-only case, because the validator user may be testing a CMS  
template that is ASCII-only at the time of testing but gets filled  
with arbitrary content at deployment time.

Anyway, as it stands, HTML5 *requires* the encoding to be declared in  
order for the document to be valid if the byte stream has *non-ASCII*  
bytes. Validator.nu issues an error in that case. Validator.nu issues  
a warning in the ASCII-only case.

Quoting the spec for reference:
> If an HTML document does not start with a BOM, and if its encoding  
> is not explicitly given by Content-Type metadata, then the character  
> encoding used must be an ASCII-compatible character encoding, and,  
> in addition, if that encoding isn't US-ASCII itself, then the  
> encoding must be specified using a meta element with a charset  
> attribute or a meta element in the Encoding declaration state.

http://www.whatwg.org/specs/web-apps/current-work/#charset

On Jun 2, 2009, at 20:27, Geoffrey Sneddon wrote:

> One possibility would be to just say something like, "conforming  
> documents MUST be encoded as UTF-8 and declare themselves to be so".
>
> The biggest problem I see with that is that from an RFC 2119 point  
> of view using UTF-8 isn't required for interoperability (for a  
> start, UAs are required to support Windows-1252 as well).

That's one problem. There are three other issues with making non-UTF-8- 
encodings non-conforming:

  1) Making UTF-16 non-conforming would open up a rathole about  
whether UTF-16 is "more efficient" for "some languages" ignoring the  
proportion of Basic Latin-range characters in markup.

  2) Making GB18030 non-conforming would open up another rathole we'd  
be better off not venturing into.

  3) Making validators whine about non-UTF-8 encodings any more than  
what Validator.nu does now would likely make HTML5 validation so  
annoying for authors who are upgrading existing sites as to make them  
ignore validation and thereby rendering the requirement irrelevant.

Validator.nu warns about 'obscure' encodings where obscure is defined  
as not widely supported based on my investigation of Firefox, IE,  
Opera, Safari, Sun Java and Python. The encodings that are not  
'obscure' are listed in the Encoding popup at http://validator.nu/.

I'd be OK with upgrading the obscure encoding warnings to errors if  
the HTML WG really wants to get deeper into blessing and shunning  
encodings. However, given that UAs will have to support the non- 
obscure encoding far into the future, making them errors would be  
unproductive despite their adverse effects on form submission and the  
query parts of URLs.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Wednesday, 3 June 2009 07:24:53 UTC