- From: Glenn Adams <glenn@skynav.com>
- Date: Mon, 16 Jan 2012 15:09:39 -0700
- To: Boris Zbarsky <bzbarsky@mit.edu>
- Cc: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
- Message-ID: <CACQ=j+c1O6605WRKAoqPunDrcbnhqQDRRAbE557tc0hz-H_tkg@mail.gmail.com>
On Mon, Jan 16, 2012 at 2:11 PM, Boris Zbarsky <bzbarsky@mit.edu> wrote: > On 1/16/12 3:54 PM, Glenn Adams wrote: > >> Apparently. I view the damage of proliferating non-well formed content, > > and thus non-interoperable content, and the damage of proliferating >> potential security holes to be of greater consequence than requiring >> script authors to address the consequences of working with non-BMP >> characters encoded in UTF-16. >> > > The problem is that you think the burden of the pain here will fall on > authors. It won't. It'll fall on users. > If scripts cause exceptions, then users will complain or not use the product (content). If script authors care, they will make their scripts work; if they don't care, i don't know how an implementation is going to fix it for the end users. > So the question is whether "proliferating non-well formed content" > (whatever that even means in this context, since the DOM and ECMAScript > don't really have such concepts) is a worse thing than locking some users > out of using parts of the web because they happen to communicate in a > language that can't be written down using the BMP. > ECMA-262 3rd Edition Section 2 Conformance states: "A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 2.1 or later, and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form." Unicode 2.0 Section 3.1 Clause C4 (which is adopted unchanged in Unicode 2.1) states: "A process shall not interpret an unpaired high- or low-surrogate as an abstract character." It is pretty clear to me that a conforming implementation of the above will not (or should not) permit e.textContent="\ud834" to complete without throwing some exception, at least without willfully violating these conformance requirements. If DOM-4 does not make this clear, then perhaps it should.
Received on Monday, 16 January 2012 22:10:33 UTC