Re: [css3-text] grapheme clusters across element boundary from Glenn Adams on 2012-01-16 (www-style@w3.org from January 2012)

From: Glenn Adams <glenn@skynav.com>
Date: Mon, 16 Jan 2012 13:54:21 -0700
To: Boris Zbarsky <bzbarsky@mit.edu>
Cc: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
Message-ID: <CACQ=j+e+2sAQUG5oRC2Ta4pS3J+BF63gBZnqVU3F5froSJj=5w@mail.gmail.com>

On Mon, Jan 16, 2012 at 1:39 PM, Boris Zbarsky <bzbarsky@mit.edu> wrote:

> On 1/16/12 3:06 PM, Glenn Adams wrote:
>
>> (2) script that naively assumes codepoint = character, and inadvertently
>> separates surrogate pair elements;
>>
>
> This is most script.
>
>  regarding (2), my position is that the implementation should be
>> conservative and not liberal when allowing script to set certain
>> property values whose underlying semantics imply a well-formed UTF-16
>> string; so, yes, were I implementing this, I would throw an exception
>> when a script attempts to set a DOMString typed property to a JS String
>> that contains an isolated surrogate codepoint; or at least I would do
>> this by default, and only depart from this default in certain
>> circumscribed cases;
>>
>
> And my point is that since pretty much every script handles surrogate
> pairs wrong throwing would just penalize users who try to use non-BMP
> characters with such scripts.  It would particular penalize users whose
> languages are written with non-BMP characters.
>
> Maybe you think it's OK to screw such users over.  I don't.

Boris, why do you use language like this? It is not conducive to a
technical dialog. As one of the authors of Unicode, I find it rather ironic
to be accused in this manner.

> Especially in situations in which the "correct" rendering is obvious (e.g.
> every single codepoint wrapped in its own span, but all have the same
> style: you just render the text as a single text string with that style).

>  this is my answer to your question "what the best way to limit damage
>> from the lack of understanding on script authors' part is"
>>
>
> I think you and I have different definitions of "damage" here.

Apparently. I view the damage of proliferating non-well formed content, and
thus non-interoperable content, and the damage of proliferating potential
security holes to be of greater consequence than requiring script authors
to address the consequences of working with non-BMP characters encoded in
UTF-16. In other words, I conclude just the opposite, that users of non-BMP
characters are penalized *more* by implementations that munge surrogate
pairs than implementations that enforce correct handling.

Received on Monday, 16 January 2012 20:55:15 UTC