Re: [css3-text] grapheme clusters across element boundary from Glenn Adams on 2012-01-16 (www-style@w3.org from January 2012)

From: Glenn Adams <glenn@skynav.com>
Date: Mon, 16 Jan 2012 13:06:06 -0700
To: Boris Zbarsky <bzbarsky@mit.edu>
Cc: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
Message-ID: <CACQ=j+dVp2DNcn9H2XWiGRZodXgnmNu_v88GMPDpzeufddGiYA@mail.gmail.com>
the problems you refer to appear (to me) to derive only from two scenarios:

(1) some implementation allowing a user to enter codepoints directly (as
opposed to entire characters), which could possibly result in isolated
surrogate codepoints;

(2) script that naively assumes codepoint = character, and inadvertently
separates surrogate pair elements;

regarding (1), i'm not sure this is a legitimate, or if legitimate, then
wise feature for an implementation to provide to naive end users; it could
clearly result in security problems of various kinds;

regarding (2), my position is that the implementation should be
conservative and not liberal when allowing script to set certain property
values whose underlying semantics imply a well-formed UTF-16 string; so,
yes, were I implementing this, I would throw an exception when a script
attempts to set a DOMString typed property to a JS String that contains an
isolated surrogate codepoint; or at least I would do this by default, and
only depart from this default in certain circumscribed cases;

this is my answer to your question "what the best way to limit damage from
the lack of understanding on script authors' part is"; i certainly would
not adopt an approach which silently allows (or even encourages) the
proliferation of isolated surrogate code points;

of course, this could be resolved, at least in theory, but upgrading JS/ES
to use unicode scalar values instead of UTF-16 codepoints, but we know this
won't happen, at least anytime soon (if ever);

On Mon, Jan 16, 2012 at 12:33 PM, Boris Zbarsky <bzbarsky@mit.edu> wrote:

> On 1/16/12 2:23 PM, Glenn Adams wrote:
>
>> the problem is that in this case codepoint != character; for a script to
>> correctly work with a UTF-16 string that contains surrogate pairs, it
>> must ensure that the individual codepoints of a surrogate pair are not
>> separated, e.g., by wrapping each codepoint as if it were a distinct
>> character
>>
>
> You know that.  I know that.  Script authors have no clue, for the most
> part.
>
>
>  if i were implementing this, i would throw a SYNTAX_ERR exception or
>> equivalent when writing the textContent property if the new value were
>> not a well-formed UTF-16 string, i.e., contains no isolated surrogate
>> code points
>>
>
> And you'd throw exceptions from using splitText in between surrogates and
> so forth?  So the upshot would be that the script would always work when
> the author tested it, but fail on some user input?
>
> Why is that better for either users or authors?
>
>
>     Would you really expect the browser to convert some of the
>>    codepoints to \ufffd when the script does that?  That seems like it
>>    would violate the principle of least surprise.  It would also mean
>>    that the script in question would work just fine in initial testing
>>    then break as soon as a user entered some non-BMP characters.
>>
>> assuming for a moment that the user enters a non-BMP character via a
>> input method that maps to a complete surrogate pair
>>
>
> Of course it does.  The user enters characters; how the browser stores
> those internally is the browser's business, but it better do it "right".
>
>
>  then the script would need to ensure it did not erringly separate or wrap
>> individual codepoints of the surrogate pair in such a way as to separate
>> them
>>
>
> But the script does nothing of the sort, in practice.  Now what?
>
>
>     I think we owe it to users to make this case work as Gecko does.
>>
>> if the implementation allows users to set textContent to a single
>> surrogate codepoint, then I believe it is doing a dis-service to the
>> users (and possibly introducing a security risk) by providing a way to
>> enter non-well formed character content
>>
>
> The implementation allows users to type in text.  The implementation
> allows scripts written by people who don't have a good concept of
> non-ASCII, much less non-BMP, characters to operate on that text.  The only
> question is what the best way to limit damage from the lack of
> understanding on script authors' part is.
>
> -Boris
>
>
Received on Monday, 16 January 2012 20:06:59 UTC