Re: Default charset in HTML5 from Nick on 2017-03-09 (www-validator@w3.org from March 2017)

From: Nick <halbtaxabo-temp4@yahoo.com>
Date: Thu, 9 Mar 2017 22:31:36 +0000 (UTC)
To: "Michael[tm] Smith" <mike@w3.org>, Nick <halbtaxabo-temp4@yahoo.com>
Cc: "www-validator@w3.org" <www-validator@w3.org>
Message-ID: <204362919.2534675.1489098696666@mail.yahoo.com>

From: Michael[tm] Smith <mike@w3.org>


>Nick <halbtaxabo-temp4@yahoo.com>, 2017-03-09 13:46 +0000:
>> Archived-At: <http://www.w3.org/mid/53672439.2120582.1489067212112@mail.yahoo.com>
>> >Michael[tm] Smith <mike@w3.org>:
....
>> >And for legacy backward-compat, if a document doesn’t declare an encoding,
>> >then browsers are required to parse it using windows-1252 as the encoding.
>> 
>> Really? Which current standards document says that?

>> https://html.spec.whatwg.org/#determining-the-character-encoding:concept-encoding-confidence-8

>> Otherwise, return an implementation-defined or user-specified default
>> character encoding, with the confidence tentative.
....
>> In other environments, the default encoding is typically dependent on the
>> user's locale (an approximation of the languages, and thus often
>> encodings, of the pages that the user is likely to frequent). The
>> following table gives suggested defaults based on the user's locale, for
>> compatibility with legacy content.

>windows-1252 is the default there for all user locales other than the ones
>explicitly listed. In the context of checking a document with the HTML
>checker there is no user locale to examine, so it uses windows-1252.

>But as you can see from that table, the encoding that browsers will use for a
>document that doesn’t declare an encoding changes based on the user’s locale.
>For example, if the user’s locale is Japanese, browsers will use Shift_JIS.


By quoting extracts from that document out of context, and omitting the history
of this discussion, you've made the document appear to say the exact opposite
of what it actually says.

The starting-point of the discussion was that when the validator encounters a
document which does not specify the encoding, but which includes utf-8 encodings
at any point in the document, it arbitrarily assumes windows-1252 and flags the
utf-8 encodings as errors.

What the cited document says - notably the bit which you omitted in
your selective quotation - is this:
>User agents must use the following algorithm, called the encoding sniffing algorithm,
>to determine the character encoding to use when decoding a document in the first pass.
>This algorithm takes as input any out-of-band metadata available to the user agent
>(e.g. the Content-Type metadata of the document) and all the bytes available so far,
>and returns a character encoding and a confidence that is either tentative or certain.

(note *MUST* use encoding sniffing)

The document does not mandate how much of the document the user agent must "sniff"
but says that user agents are "encouraged" to use a prescan algorithm on the first
1024 bytes. (The file which provoked me to start this discussion was less than 1024 bytes
in total).

The document goes on to "note" that:
>The UTF-8 encoding has a highly detectable bit pattern.


So the validator is certainly not doing what the document you cited encourages a
user agent to do. Thank you for finding a document which points that out.

Nick

Received on Thursday, 9 March 2017 22:36:46 UTC