RE: Auto-detect and encodings in HTML5 from M.T. Carrasco Benitez on 2009-06-02 (public-html@w3.org from June 2009)

From: M.T. Carrasco Benitez <mtcarrascob@yahoo.com>
Date: Tue, 2 Jun 2009 09:23:51 -0700 (PDT)
To: Anne van Kesteren <annevk@opera.com>, Chris Wilson <Chris.Wilson@microsoft.com>, Maciej Stachowiak <mjs@apple.com>, Larry Masinter <masinter@adobe.com>, AddisonPhillips <addison@amazon.com>
Cc: Travis Leithead <Travis.Leithead@microsoft.com>, Erik van der Poel <erikv@google.com>, "public-html@w3.org" <public-html@w3.org>, "www-international@w3.org" <www-international@w3.org>, Richard Ishida <ishida@w3.org>, Ian Hickson <ian@hixie.ch>, Harley Rosnow <Harley.Rosnow@microsoft.com>
Message-ID: <722842.79848.qm@web32404.mail.mud.yahoo.com>

- The discussion here is about consuming. In particular, no default encoding in authoring: use whatever encoding you like, but please label it properly. This was the consensus about a dozen years ago, beautifully posted (if I remember properly) by Duerst, Masinter or Yergeau.

- As already commented, the encoding must be send in the HTTP header: problem solved.

- Otherwise, there must be an "standard auto-detect algorithm" that always output one of the mandatory encodings. The suggestion is that if the N-1 step has not found an encoding, the step N is encoding=UTF8.

- Then, one can design the "standard auto-detect algorithm":
  + Reading so many bytes
  + META
  + Etc

- All this taking into account the posting of Larry: 
  + "reducing ambiguity and making web transactions more reliable"
  + "opposed to making an incompatible change with actual current behavior."

Tomas


--- On Tue, 2/6/09, Phillips, Addison <addison@amazon.com> wrote:

> The problem with making UTF-8 the "last resort" encoding is
> that, ironically, it is possible to detect when something
> isn't UTF-8 and thus know that the encoding selected is
> wrong (this is not true of most encodings). If a document
> really isn't UTF-8, the byte pattern will quite probably
> reveal that fact, although possibly after an inconveniently
> large number of bytes in the document have been read. So to
> make an encoding the "last resort" and presenting data in a
> way known to be flawed seems less than ideal :-(. It might
> be better to offer the user the opportunity to correct the
> encoding, etc., in that case.
> 
> UTF-8 might be a good guess for higher in the encoding
> detection stack, though, and by all means should be the
> "default" (that is, recommended) encoding for authoring Web
> documents. If encoding announcement (via meta or some other
> mechanism) can be required in HTML5, it would also be good
> to make it the default encoding there.

Received on Tuesday, 2 June 2009 16:33:21 UTC