Re: [css3-text] grapheme clusters across element boundary from Boris Zbarsky on 2012-01-16 (www-style@w3.org from January 2012)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 16 Jan 2012 12:02:46 -0500
To: Glenn Adams <glenn@skynav.com>
CC: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
Message-ID: <4F145836.9050000@mit.edu>

On 1/16/12 10:10 AM, Glenn Adams wrote:
> it is certainly questionable authoring for a surrogate pair to be
> separated by a element boundary; if it appeared in actual input text, it
> would certainly not be well-formed UTF-16; i would prefer a browser to
> translate (or interpret) each member of the pair in the following
> example to (as) the replacement character (\ufffd).

Consider the simple case HTML that has some string containing non-BMP 
characters and then a script that takes the text and wraps each 
"character" (which from the point of view of JS means each codepoint) in 
a <span>.  These are not that uncommon, by the way.  Oh, and they 
commonly run on user input, not on data provided by the site itself.

Would you really expect the browser to convert some of the codepoints to 
\ufffd when the script does that?  That seems like it would violate the 
principle of least surprise.  It would also mean that the script in 
question would work just fine in initial testing then break as soon as a 
user entered some non-BMP characters.

I think we owe it to users to make this case work as Gecko does.

-Boris

Received on Monday, 16 January 2012 17:03:19 UTC