W3C home > Mailing lists > Public > public-i18n-core@w3.org > October to December 2007

RE: HTML 5 defaults to Windows-1252, where charmod requiresUTF-8/UTF-16

From: Dan Connolly <connolly@w3.org>
Date: Mon, 29 Oct 2007 12:51:17 -0500
To: Richard Ishida <ishida@w3.org>, public-i18n-core@w3.org
Cc: 'www-archive' <www-archive@w3.org>, 'Chris Wilson' <Chris.Wilson@microsoft.com>
Message-Id: <1193680277.6433.692.camel@pav>


On Mon, 2007-10-29 at 17:42 +0000, Richard Ishida wrote:
> Hi Dan,
> 
> Please send questions like this to public-i18n-core list, so that the i18n
> WG can reply.

OK. done.

> It's not clear to me from a quick look that there's a conflict.  CharMod
> says that you must define one or both of UTF-8 and UTF-16 as *a default*,
> and HTML5 is defining minimum set of encodings that must be supported,
> rather than a default (as I read it).  CharMod doesn't proscribe recogition
> of other encodings.
> 
> I think the appropriate charmod criterion for the html5 text in section
> 8.2.2.2 is http://www.w3.org/TR/charmod/#C026 "If the unique encoding
> approach is not chosen, specifications MUST designate at least one of the
> UTF-8 and UTF-16 encoding forms of Unicode as admissible character encodings
> and SHOULD choose at least one of UTF-8 or UTF-16 as required encoding forms
> (encoding forms that MUST be supported by implementations of the
> specification)." - which I think section 8.2.2.2 of html5 supports.
> 
> >From my reading, the 'defaults to win1252' bit comes only if the user
> specifies that a page is in ISO latin1 - ie. Assume that people don't know
> the difference between those two.  It's not a general default.  I don't see
> where html5 specifies what to default to if the encoding is completely
> unknown.

I suppose it's in 8.2.2.1. Determining the character encoding:

"Otherwise, return an implementation-defined or user-specified default
character encoding, with the confidence tentative. Due to its use in
legacy content, windows-1252 is recommended as a default in
predominantly Western demographics. In non-legacy environments, the more
comprehensive UTF-8 encoding is recommended instead. Since these
encodings can in many cases be distinguished by inspection, a user agent
may heuristically decide which to use as a default."

>   According to charmod, this is when you should choose utf-8 or
> utf-16.  (There may be something about that later in html5.)
> 
> Does that make sense?

I suppose so; I'm happy with any conclusion that says I don't
need to do more work. ;-)

> Cheers,
> RI
> 
> ============
> Richard Ishida
> Internationalization Lead
> W3C (World Wide Web Consortium)
>  
> http://www.w3.org/International/
> http://rishida.net/blog/
> http://rishida.net/
> 
>  
> 
> 
> > -----Original Message-----
> > From: Dan Connolly [mailto:connolly@w3.org] 
> > Sent: 29 October 2007 17:22
> > To: Richard Ishida
> > Cc: www-archive; Chris Wilson
> > Subject: HTML 5 defaults to Windows-1252, where charmod 
> > requiresUTF-8/UTF-16
> > 
> > Richard,
> > 
> > These conflict:
> > 
> > "C027   [S]  Specifications that require a default encoding 
> > MUST define
> > either UTF-8 or UTF-16 as the default, or both if they define 
> > suitable means of distinguishing them."
> >  -- http://www.w3.org/TR/charmod/#C027
> > 
> > "User agents must at a minimum support the UTF-8 and 
> > Windows-1252 encodings, but may support more." -- 8.2.2.2. 
> > Character encoding requirements http://www.w3.org/html/wg/html5/ 
> > 
> > I don't think that aspect of the HTML 5 spec is going to 
> > change; it's already ubiquitously deployed:
> > 
> >  "Many web browsers treat the MIME charset ISO-8859-1 as 
> > Windows-1252 "
> > -- http://en.wikipedia.org/wiki/Windows-1252 
> > 
> > Any suggestions on what to do about the conflict? It's not 
> > clear to me why C027 is a MUST. Which WG(s) should we be talking to?
> > 
> > p.s. note the cc to www-archive; i.e. feel free to 
> > copy/cite/forward anywhere.

-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
gpg D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
Received on Monday, 29 October 2007 17:50:08 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 October 2008 10:18:52 GMT