Re: [css3-text] grapheme clusters across element boundary from Glenn Adams on 2012-01-16 (www-style@w3.org from January 2012)

From: Glenn Adams <glenn@skynav.com>
Date: Mon, 16 Jan 2012 15:09:39 -0700
To: Boris Zbarsky <bzbarsky@mit.edu>
Cc: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
Message-ID: <CACQ=j+c1O6605WRKAoqPunDrcbnhqQDRRAbE557tc0hz-H_tkg@mail.gmail.com>

On Mon, Jan 16, 2012 at 2:11 PM, Boris Zbarsky <bzbarsky@mit.edu> wrote:

> On 1/16/12 3:54 PM, Glenn Adams wrote:
>
>> Apparently. I view the damage of proliferating non-well formed content,
>
> and thus non-interoperable content, and the damage of proliferating
>> potential security holes to be of greater consequence than requiring
>> script authors to address the consequences of working with non-BMP
>> characters encoded in UTF-16.
>>
>
> The problem is that you think the burden of the pain here will fall on
> authors.  It won't.  It'll fall on users.
>

If scripts cause exceptions, then users will complain or not use the
product (content). If script authors care, they will make their scripts
work; if they don't care, i don't know how an implementation is going to
fix it for the end users.

> So the question is whether "proliferating non-well formed content"
> (whatever that even means in this context, since the DOM and ECMAScript
> don't really have such concepts) is a worse thing than locking some users
> out of using parts of the web because they happen to communicate in a
> language that can't be written down using the BMP.
>

ECMA-262 3rd Edition Section 2 Conformance states:

"A conforming implementation of this International standard shall interpret
characters in conformance with the

Unicode Standard, Version 2.1 or later, and ISO/IEC 10646-1 with either
UCS-2 or UTF-16 as the adopted

encoding form, implementation level 3. If the adopted ISO/IEC 10646-1
subset is not otherwise specified, it is

presumed to be the BMP subset, collection 300. If the adopted encoding form
is not otherwise specified, it
presumed to be the UTF-16 encoding form."

Unicode 2.0 Section 3.1 Clause C4 (which is adopted unchanged in Unicode
2.1) states:

"A process shall not interpret an unpaired high- or low-surrogate as an
abstract character."

It is pretty clear to me that a conforming implementation of the above will
not (or should not) permit e.textContent="\ud834" to complete without
throwing some exception, at least without willfully violating these
conformance requirements.

If DOM-4 does not make this clear, then perhaps it should.

Received on Monday, 16 January 2012 22:10:33 UTC