Re: [css3-text] grapheme clusters across element boundary from Glenn Adams on 2012-01-16 (www-style@w3.org from January 2012)

From: Glenn Adams <glenn@skynav.com>
Date: Mon, 16 Jan 2012 12:23:15 -0700
To: Boris Zbarsky <bzbarsky@mit.edu>
Cc: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
Message-ID: <CACQ=j+eUe9RQU0LR574uyhEWQVHZ81BPDbE+u1=CyZgbam1pSw@mail.gmail.com>

On Mon, Jan 16, 2012 at 10:02 AM, Boris Zbarsky <bzbarsky@mit.edu> wrote:

> On 1/16/12 10:10 AM, Glenn Adams wrote:
>
>> it is certainly questionable authoring for a surrogate pair to be
>> separated by a element boundary; if it appeared in actual input text, it
>> would certainly not be well-formed UTF-16; i would prefer a browser to
>> translate (or interpret) each member of the pair in the following
>> example to (as) the replacement character (\ufffd).
>>
>
> Consider the simple case HTML that has some string containing non-BMP
> characters and then a script that takes the text and wraps each "character"
> (which from the point of view of JS means each codepoint) in a <span>.
>  These are not that uncommon, by the way.  Oh, and they commonly run on
> user input, not on data provided by the site itself.
>

the problem is that in this case codepoint != character; for a script to
correctly work with a UTF-16 string that contains surrogate pairs, it must
ensure that the individual codepoints of a surrogate pair are not
separated, e.g., by wrapping each codepoint as if it were a distinct
character

by analogy, this is the same as if a UTF-8 encoded byte sequence were
separated into individual bytes, wrapping each byte as if it were a
distinct character; both behaviors are incorrect

if i were implementing this, i would throw a SYNTAX_ERR exception or
equivalent when writing the textContent property if the new value were not
a well-formed UTF-16 string, i.e., contains no isolated surrogate code
points


>
> Would you really expect the browser to convert some of the codepoints to
> \ufffd when the script does that?  That seems like it would violate the
> principle of least surprise.  It would also mean that the script in
> question would work just fine in initial testing then break as soon as a
> user entered some non-BMP characters.
>

assuming for a moment that the user enters a non-BMP character via a input
method that maps to a complete surrogate pair (which would be required if
it were in fact a defined non-BMP character), then the script would need to
ensure it did not erringly separate or wrap individual codepoints of the
surrogate pair in such a way as to separate them


>
> I think we owe it to users to make this case work as Gecko does.


if the implementation allows users to set textContent to a single surrogate
codepoint, then I believe it is doing a dis-service to the users (and
possibly introducing a security risk) by providing a way to enter non-well
formed character content

Received on Monday, 16 January 2012 19:24:06 UTC