RE: Auto-detect and encodings in HTML5

- The discussion here is about consuming. In particular, no default encoding in authoring: use whatever encoding you like, but please label it properly. This was the consensus about a dozen years ago, beautifully posted (if I remember properly) by Duerst, Masinter or Yergeau.

- As already commented, the encoding must be send in the HTTP header: problem solved.

- Otherwise, there must be an "standard auto-detect algorithm" that always output one of the mandatory encodings. The suggestion is that if the N-1 step has not found an encoding, the step N is encoding=UTF8.

- Then, one can design the "standard auto-detect algorithm":
  + Reading so many bytes
  + META
  + Etc

- All this taking into account the posting of Larry: 
  + "reducing ambiguity and making web transactions more reliable"
  + "opposed to making an incompatible change with actual current behavior."


--- On Tue, 2/6/09, Phillips, Addison <> wrote:

> The problem with making UTF-8 the "last resort" encoding is
> that, ironically, it is possible to detect when something
> isn't UTF-8 and thus know that the encoding selected is
> wrong (this is not true of most encodings). If a document
> really isn't UTF-8, the byte pattern will quite probably
> reveal that fact, although possibly after an inconveniently
> large number of bytes in the document have been read. So to
> make an encoding the "last resort" and presenting data in a
> way known to be flawed seems less than ideal :-(. It might
> be better to offer the user the opportunity to correct the
> encoding, etc., in that case.
> UTF-8 might be a good guess for higher in the encoding
> detection stack, though, and by all means should be the
> "default" (that is, recommended) encoding for authoring Web
> documents. If encoding announcement (via meta or some other
> mechanism) can be required in HTML5, it would also be good
> to make it the default encoding there. 


Received on Tuesday, 2 June 2009 16:33:21 UTC