- From: Ian Hickson <ian@hixie.ch>
- Date: Sun, 30 Aug 2009 02:37:13 +0000 (UTC)
- To: "Phillips, Addison" <addison@amazon.com>, Maciej Stachowiak <mjs@apple.com>, Henri Sivonen <hsivonen@iki.fi>, Anne van Kesteren <annevk@opera.com>, Andrew Cunningham <andrewc@vicnet.net.au>, Richard Ishida <ishida@w3.org>
- Cc: "public-html@w3.org" <public-html@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
On Wed, 19 Aug 2009, Phillips, Addison wrote: > > We remain concerned about the text in Step 7 in this section: > > http://dev.w3.org/html5/spec/Overview.html#determining-the-character-encoding > > Your current text reads: > > "Otherwise, return an implementation-defined or user-specified default > character encoding, with the confidence tentative. In non-legacy > environments, the more comprehensive UTF-8 encoding is recommended. Due > to its use in legacy content, windows-1252 is recommended as a default > in predominantly Western demographics instead. Since these encodings can > in many cases be distinguished by inspection, a user agent may > heuristically decide which to use as a default." > > Our concerns about this text are: > > 1. It isn't clear what constitutes a "legacy" or "non-legacy > environment". The Web is a legacy environment. Non-legacy environments are new walled gardens. > We think that, for modern implementations, a bare recommendation of > UTF-8 would be preferable. Indeed, when legacy concerns do not apply, that's what the spec suggests. > 2. The sentence starting "Since these encodings can {...} be > distinguished by inspection" is not really accurate. If the user agent > has performed the optional step (6), then heuristic detection has > already been applied and failed. If the user agent has not done step > (6), then the only reasonable encoding that can reliably be detected > based solely on bit-pattern is UTF-8. Good point. I've removed that text. > 3. We think your intention is to permit the feature most browsers have > of allowing the user to configure (from a base default) the character > encoding to use when displaying a given page. Right, the requirement is to return an "implementation-defined or _user-specified_ default character encoding" (emphasis eadded). > The sentence starting "Due to its use..." mentions "predominantly > Western demographics", which we find troublesome, especially given that > it is associated with the keyword "recommended". Why? > 4. We suggest adding to step (6) this note: > > "Note: The UTF-8 encoding has a highly detectable bit pattern. Documents > that contain bytes 0x7F which match the UTF-8 pattern are very likely to > be UTF-8, while documents that do not match it definitely are not. While > not full autodetection, it may be appropriate for a user-agent to search > for this common encoding." I haven't added this, as I don't want this step to turn into a long list of possible algorithms to use. However, if you have other papers I should reference in addition to [UNIVCHARDET], I'm happy to add references. On Thu, 20 Aug 2009, Maciej Stachowiak wrote: > > Based on further discussion with you and Henri, I filed the following: > > http://www.w3.org/Bugs/Public/show_bug.cgi?id=7380 > "Suggest heuristic detection of UTF-8" > > http://www.w3.org/Bugs/Public/show_bug.cgi?id=7381 > "Clarify default encoding wording and add some examples for non-latin > locales." Thanks. I will get to these in due course. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Sunday, 30 August 2009 02:35:34 UTC