Re: [css3-text] grapheme clusters across element boundary from Boris Zbarsky on 2012-01-16 (www-style@w3.org from January 2012)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 16 Jan 2012 14:33:11 -0500
To: Glenn Adams <glenn@skynav.com>
CC: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
Message-ID: <4F147B77.4030306@mit.edu>

On 1/16/12 2:23 PM, Glenn Adams wrote:
> the problem is that in this case codepoint != character; for a script to
> correctly work with a UTF-16 string that contains surrogate pairs, it
> must ensure that the individual codepoints of a surrogate pair are not
> separated, e.g., by wrapping each codepoint as if it were a distinct
> character

You know that.  I know that.  Script authors have no clue, for the most 
part.

> if i were implementing this, i would throw a SYNTAX_ERR exception or
> equivalent when writing the textContent property if the new value were
> not a well-formed UTF-16 string, i.e., contains no isolated surrogate
> code points

And you'd throw exceptions from using splitText in between surrogates 
and so forth?  So the upshot would be that the script would always work 
when the author tested it, but fail on some user input?

Why is that better for either users or authors?

>     Would you really expect the browser to convert some of the
>     codepoints to \ufffd when the script does that?  That seems like it
>     would violate the principle of least surprise.  It would also mean
>     that the script in question would work just fine in initial testing
>     then break as soon as a user entered some non-BMP characters.
>
> assuming for a moment that the user enters a non-BMP character via a
> input method that maps to a complete surrogate pair

Of course it does.  The user enters characters; how the browser stores 
those internally is the browser's business, but it better do it "right".

> then the script would need to ensure it did not erringly separate or wrap
> individual codepoints of the surrogate pair in such a way as to separate
> them

But the script does nothing of the sort, in practice.  Now what?

>     I think we owe it to users to make this case work as Gecko does.
>
> if the implementation allows users to set textContent to a single
> surrogate codepoint, then I believe it is doing a dis-service to the
> users (and possibly introducing a security risk) by providing a way to
> enter non-well formed character content

The implementation allows users to type in text.  The implementation 
allows scripts written by people who don't have a good concept of 
non-ASCII, much less non-BMP, characters to operate on that text.  The 
only question is what the best way to limit damage from the lack of 
understanding on script authors' part is.

-Boris

Received on Monday, 16 January 2012 19:33:47 UTC