[whatwg] Character-encoding-related threads

On Mon, 13 Feb 2012, Simon Pieters wrote:
> On Sat, 11 Feb 2012 00:44:22 +0100, Ian Hickson <ian at hixie.ch> wrote:
> > On Wed, 7 Dec 2011, Henri Sivonen wrote:
> > > 
> > > I believe I was implementing exactly what the spec said at the time 
> > > I implemented that behavior of Validator.nu. I'm particularly 
> > > convinced that I was following the spec, because I think it's not 
> > > the optimal behavior. I think pages that don't declare their 
> > > encoding should always be non-conforming even if they only contain 
> > > ASCII bytes, because that way templates created by English-oriented 
> > > (or lorem ipsum -oriented) authors would be caught as non-conforming 
> > > before non-ASCII text gets filled into them later. Hixie disagreed.
> > 
> > I think it puts an undue burden on authors who are just writing small 
> > files with only ASCII. 7-bit clean ASCII is still the second-most used 
> > encoding on the Web (after UTF-8), so I don't think it's a small 
> > thing.
> > 
> > http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html
> I think this is like saying that requiring <!DOCTYPE HTML> is an undue 
> burden on authors...

It is. You may recall we tried really hard to make it shorter. At the end 
of the day, however, "<!DOCTYPE HTML>" is the best we could do.

> ...on authors who are just writing small files that don't use CSS or 
> happen to not be affected by any quirk.

If you have data showing that this would be as many documents as the 
ASCII-only documents, then it would be worth considering. In practice 
though I think it would be a very small group of pages, far fewer than 
the double-digit percentages using ASCII.

> In practice, authors who don't declare their encoding can silence the 
> validator by using entities for their non-ASCII characters, but they 
> will still get bitten by encoding problems as soon as they want to 
> submit forms or resolve URLs with %-escaped stuff in the query 
> component, and so forth, so it seems to me authors would be better off 
> if we said that the encoding cruft is required cruft just like the 
> doctype cruft.

Hm, that's an interesting point. Can we make a list of features that rely 
on the character encoding and have the spec require an encoding if any of 
those are used?

If the list is long or includes anything that it's unreasonable to expect 
will not be used in most Web pages, then we should remove this particular 
"hole" in the conformance criteria.

Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Monday, 13 February 2012 09:22:13 UTC