- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Mon, 16 Jan 2012 14:33:11 -0500
- To: Glenn Adams <glenn@skynav.com>
- CC: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
On 1/16/12 2:23 PM, Glenn Adams wrote: > the problem is that in this case codepoint != character; for a script to > correctly work with a UTF-16 string that contains surrogate pairs, it > must ensure that the individual codepoints of a surrogate pair are not > separated, e.g., by wrapping each codepoint as if it were a distinct > character You know that. I know that. Script authors have no clue, for the most part. > if i were implementing this, i would throw a SYNTAX_ERR exception or > equivalent when writing the textContent property if the new value were > not a well-formed UTF-16 string, i.e., contains no isolated surrogate > code points And you'd throw exceptions from using splitText in between surrogates and so forth? So the upshot would be that the script would always work when the author tested it, but fail on some user input? Why is that better for either users or authors? > Would you really expect the browser to convert some of the > codepoints to \ufffd when the script does that? That seems like it > would violate the principle of least surprise. It would also mean > that the script in question would work just fine in initial testing > then break as soon as a user entered some non-BMP characters. > > assuming for a moment that the user enters a non-BMP character via a > input method that maps to a complete surrogate pair Of course it does. The user enters characters; how the browser stores those internally is the browser's business, but it better do it "right". > then the script would need to ensure it did not erringly separate or wrap > individual codepoints of the surrogate pair in such a way as to separate > them But the script does nothing of the sort, in practice. Now what? > I think we owe it to users to make this case work as Gecko does. > > if the implementation allows users to set textContent to a single > surrogate codepoint, then I believe it is doing a dis-service to the > users (and possibly introducing a security risk) by providing a way to > enter non-well formed character content The implementation allows users to type in text. The implementation allows scripts written by people who don't have a good concept of non-ASCII, much less non-BMP, characters to operate on that text. The only question is what the best way to limit damage from the lack of understanding on script authors' part is. -Boris
Received on Monday, 16 January 2012 19:33:47 UTC