Re: Auto-detect and encodings in HTML5 from Ian Hickson on 2009-06-11 (public-html@w3.org from June 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 11 Jun 2009 23:16:38 +0000 (UTC)
To: public-html@w3.org
Message-ID: <Pine.LNX.4.62.0906112139310.1648@hixie.dreamhostps.com>
On Tue, 26 May 2009, Travis Leithead wrote:
> 
> The proposal is straight-forward. Only in pages with the HTML5 doctype:
> 
> 1.  Forbid the use of auto-detect heuristics for HTML encodings.

There is already an opt-in for disabling the heuristics: setting the 
encoding explicitly. We can't use the DOCTYPE to decide whether to use the 
heuristics or not, though, since we need to pick an encoding before the 
DOCTYPE is parsed.


> 2.  Forbid the use problematic encodings such as UTF7 and EBCDIC.
>     Essentially, get rid of the classes of encodings in which
>     Jscript and tags do not correspond to simple ASCII characters
>     in the raw byte stream.

These are already forbidden. (Well, UTF-7 is. EBCDIC is discouraged.)


> 3.  Only handling the encoding in the first META tag within the
>     HEAD and requiring that the HEAD and META tags to appear within
>     a well-defined, fixed byte distance into the file to take effect.

In practice this isn't compatible with legacy content. There is already an 
opt-in mechanism to avoid looking for a <meta> beyond the <head>, namely, 
explicitly setting an encoding in the <head>. So I do not think it is 
necessary to add a new one.


> 4.  Require the default HTML encoding to be UTF8.

There is already a way to opt-in to UTF-8, namely, explicitly setting the 
encoding to UTF-8. We wouldn't want to introduce a new one because this 
would mean legacy UAs could detect a different encoding.


> B.  HTML5 would be able to process markup more efficiently by
>     reducing the scanning and computation required to merely
>     determine the encoding of the file.

As far as I can tell, in practice what HTML5 requires is pretty light and 
wouldn't be substantially affected by the proposed changed.


> C.  Since sometimes the heuristics or default encoding uses
>     information about the user's environment, we often see pages
>     that display quite differently from one region to another.
>     As much as possible, browsing from across the globe should
>     give a consistent experience for a given page.  (Basically, I
>     want my children to one day stop seeing garbage when they
>     browse Japanese web sites from the US.)

Agreed.


> D.  We'd greatly increase the consistency of implementation of
>     markup handling by the various user agents. These openings
>     for UA-specific heuristics and decisions, undermines the
>     benefits of standards and standardization.

Indeed. This is why HTML5 makes the heuristics step the last gap measure, 
and why it is non-conforming to write HTML documents that trigger it.


On Wed, 27 May 2009, Henri Sivonen wrote:
>
> I support making the number of bytes that the prescan applies to a fixed 
> number. I think the number should not be smaller than 512 bytes and not 
> be larger than 1024 bytes.

The main reason the spec allows the prescan to be skipped is that it is a 
performance bottleneck in cases where only n-1 bytes have been received, 
and it doesn't substantially affect the user experience except in the case 
of the declaration coming after a script.


On Mon, 1 Jun 2009, Maciej Stachowiak wrote:
> 
> Agreed. I have no problem with authoring tools or servers producing UTF-8 by
> default, as long as they explicitly flag it. In fact, HTML tooling defaulting
> to UTF-8 would be great! But as I understand it, the proposal on the table was
> to change the behavior of HTML consumers, and that I would object to.

On Wed, 3 Jun 2009, Henri Sivonen wrote:
> 
> *Of course* authoring tools
> should use UTF-8 *and declare it* for any new documents.
> 
> HTML5 already says: "Authors are encouraged to use UTF-8."
> http://www.whatwg.org/specs/web-apps/current-work/#charset

I could make this stronger if people think that would be helpful.


On Tue, 2 Jun 2009, Ira McDonald wrote:
> 
> I suggest that claiming conformance to HTML5 means that you MUST
> always supply an explicit charset declaration on the Content-Type
> line - no confusion at all for older browsers and content management
> systems.

With the exception of US-ASCII pages, this is already the
case. Providing an encoding declaration for US-ASCII is unnecessary
because if a document's encoding isn't US-ASCII-compatible, the markup
can't be parsed anyway without knowing the encoding. In other words,
all the possible encodings when the encoding declaration is omitted
are US-ASCII compatible. (For the purposes of this discussion, I'm
treating BOMs as encoding declarations.)


On Wed, 3 Jun 2009, Henri Sivonen wrote:
> 
> My counter-argument is that it's useful for a validator to whine in
> the ASCII-only case, because the validator user may be testing a CMS
> template that is ASCII-only at the time of testing but gets filled
> with arbitrary content at deployment time.

I think it is reasonable for a validator to warn if a document is
US-ASCII without a declaration.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 11 June 2009 23:17:14 UTC