[whatwg] Character-encoding-related threads from Ian Hickson on 2012-02-10 (public-whatwg-archive@w3.org from February 2012)

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 10 Feb 2012 23:44:22 +0000 (UTC)
Message-ID: <Pine.LNX.4.64.1202102229120.11170@ps20323.dreamhostps.com>
On Mon, 6 Jun 2011, Boris Zbarsky wrote:
> 
> You can detect other effects by seeing what unescape() does in the 
> resulting document, iirc.

Doesn't seem like it:

   http://junkyard.damowmow.com/499
   http://junkyard.damowmow.com/500

In both cases, unescape() is assuming Win1252, even though in one case 
the encoding is claimed as UTF-8.


> As well as URIs including %-encoded bytes and so forth.

In both cases here, I see URLs getting interpreted as UTF-8, not based on 
the encoding of the containing page:

   http://junkyard.damowmow.com/501
   http://junkyard.damowmow.com/502


> Also you can detect what charset is used for stylesheets included by the 
> document that don't declare their own charset.

My head hurt too much from setting up the previous two tests to actually 
test this.


> There are probably other places that use the document encoding.  Worth 
> testing some of this stuff....

I'm happy to consider specific tests. Currently however, it seems like 
Firefox is the only one with any kind of magic involved in determining the 
encoding of javascript: URLs at all, and that magic doesn't seem to have 
as many side effects as one would expect, so I've left it as is.


On Wed, 30 Nov 2011, Faruk Ates wrote:
>
> My understanding is that all browsers default to Western Latin 
> (ISO-8859-1) encoding by default (for Western-world downloads/OSes) due 
> to legacy content on the web. But how relevant is that still today? Has 
> any browser done any recent research into the need for this?
> 
> I'm wondering if it might not be good to start encouraging defaulting to 
> UTF-8, and only fallback to Western Latin if it is detected that the 
> content is very old / served by old infrastructure or servers, etc. And 
> of course if the content is served with an explicit encoding of Western 
> Latin.

That is in fact exactly what the spec requires. The way that we detect 
that the content is "very old / served by old infrastructure" is that it 
lacks a character encoding declaration... :-)


On Wed, 30 Nov 2011, L. David Baron wrote:
> 
> I would, however, like to see movement towards defaulting to UTF-8: the 
> current situation makes the Web less world-wide because pages that work 
> for one user don't work for another.
> 
> I'm just not quite sure how to get from here to there, though, since 
> such changes are likely to make users experience broken content.

One of the ways I have personally been pushing UTF-8 in the specs is by 
making new formats only support UTF-8.


On Thu, 1 Dec 2011, Sergiusz Wolicki wrote:
>
> I have read section 4.2.5.5 of the WHATWG HTML spec and I think it is 
> sufficient.  It requires that any non-US-ASCII document has an explicit 
> character encoding declaration. It also recommends UTF-8 for all new 
> documents and for authoring tools' default encoding.  Therefore, any 
> document conforming to HTML5 should not pose any problem in this area.
> 
> The default encoding issue is therefore for old stuff.  But I have seen 
> a lot of pages, in browsers and in mail, that were tagged with one 
> encoding and encoded in another.  Hence, documents without a charset 
> declaration are only one of the reasons of garbage we see. Therefore, I 
> see no point in trying to fix anything in browsers by changing the 
> ancient defaults (risking compatibility issues). Energy should go into 
> filing bugs against misbehaving authoring tools and into adding proper 
> recommendations and education in HTML guidelines and tutorials.

Indeed.


On Fri, 2 Dec 2011, Henri Sivonen wrote:
> On Thu, Dec 1, 2011 at 8:29 PM, Brett Zamir <brettz9 at yahoo.com> wrote:
> > How about a "Compatibility Mode" for the older non-UTF-8 character set 
> > approach, specific to page?
> 
> That compatibility mode already exists: It's the default mode--just like 
> the quirks mode is the default for pages that don't have a doctype. You 
> opt out of the quirks mode by saying <!DOCTYPE html>. You opt out of the 
> encoding compatibility mode by saying <meta charset=utf-8>.

Quite.


On Mon, 5 Dec 2011, Darin Adler wrote:
> On Dec 5, 2011, at 4:10 PM, Kornel Lesi?ski wrote:
> > 
> > Could <!DOCTYPE html> be an opt-in to default UTF-8 encoding?
> > 
> > It would be nice to minimize number of declarations a page needs to 
> > include.
> 
> I like that idea. Maybe it's not too late.

Just configure your server to send back UTF-8 character encoding 
declarations by default, and you don't need to think about it.


On Wed, 7 Dec 2011, Henri Sivonen wrote:
> 
> If you want to minimize the declarations, you can put the UTF-8 BOM 
> followed by <!DOCTYPE html> at the start of the file.

That is indeed another terse solution.


On Mon, 5 Dec 2011, Sergiusz Wolicki wrote:
> 
> As far as I understand, HTML5 defines US-ASCII to be the default and 
> requires that any other encoding is explicitly declared. I do like this 
> approach.

It's important not to confuse the default for authors (which is indeed 
ASCII) and the default for browsers (which is a complicated answer, but 
which defines the processing for bytes in the range 0x80-0xFF, which are 
not defined in ASCII). HTML defines both.


On Wed, 7 Dec 2011, Henri Sivonen wrote:
> 
> I believe I was implementing exactly what the spec said at the time I 
> implemented that behavior of Validator.nu. I'm particularly convinced 
> that I was following the spec, because I think it's not the optimal 
> behavior. I think pages that don't declare their encoding should always 
> be non-conforming even if they only contain ASCII bytes, because that 
> way templates created by English-oriented (or lorem ipsum -oriented) 
> authors would be caught as non-conforming before non-ASCII text gets 
> filled into them later. Hixie disagreed.

I think it puts an undue burden on authors who are just writing small 
files with only ASCII. 7-bit clean ASCII is still the second-most used 
encoding on the Web (after UTF-8), so I don't think it's a small thing.

http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html


On Mon, 19 Dec 2011, Henri Sivonen wrote:
> 
> Hmm. The HTML spec isn't too clear about when alias resolution happens, 
> to I (incorrectly, I now think) mapped only "UTF-16", "UTF-16BE" and 
> "UTF-16LE" (ASCII-case-insensitive) to UTF-8 in meta without considering 
> aliases at that point. Hixie, was alias resolution supposed to happen 
> first? In Firefox, alias resolution happen after, so <meta 
> charset=iso-10646-ucs-2> is ignored per the non-ASCII superset rule.

Assuming you mean for cases where the spec says things like "If encoding 
is a UTF-16 encoding, then change the value of encoding to UTF-8", then 
any alias of UTF-16, UTF-16LE, and UTF-16BE (there aren't any registered 
currently, but "Unicode" might need to be one) would be considered a 
match. It doesn't matter if I refer to you as Henri or hsivonen, in both 
cases we're talking about the same person.

There's also the overrides. The way the spec is written, the "Character 
encoding overrides" are not aliases (at least not in the character 
encoding registry sense). The spec just requires that when you would 
instantiate a Foo decoder or encoder, you instead use a Bar one.

Currently, "iso-10646-ucs-2" is neither an alias for UTF-16 nor an 
encoding that is overridden in any way. It's its own encoding.

Anne's "Encoding" draft may make overrides into aliases. In practice I do 
not believe this will cause any normative change to the behaviour since 
the only time that character encoding identities are compared (as opposed 
being used, where the overrides kick in) is for UTF-8 and UTF-16 
encodings, and those don't have overrides (and hopefully never will).

I hope the above is clear. Let me know if you think the spec is vague on 
the matter.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 10 February 2012 15:44:22 UTC