- From: Glenn Adams <glenn@skynav.com>
- Date: Mon, 16 Jan 2012 12:23:15 -0700
- To: Boris Zbarsky <bzbarsky@mit.edu>
- Cc: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
- Message-ID: <CACQ=j+eUe9RQU0LR574uyhEWQVHZ81BPDbE+u1=CyZgbam1pSw@mail.gmail.com>
On Mon, Jan 16, 2012 at 10:02 AM, Boris Zbarsky <bzbarsky@mit.edu> wrote: > On 1/16/12 10:10 AM, Glenn Adams wrote: > >> it is certainly questionable authoring for a surrogate pair to be >> separated by a element boundary; if it appeared in actual input text, it >> would certainly not be well-formed UTF-16; i would prefer a browser to >> translate (or interpret) each member of the pair in the following >> example to (as) the replacement character (\ufffd). >> > > Consider the simple case HTML that has some string containing non-BMP > characters and then a script that takes the text and wraps each "character" > (which from the point of view of JS means each codepoint) in a <span>. > These are not that uncommon, by the way. Oh, and they commonly run on > user input, not on data provided by the site itself. > the problem is that in this case codepoint != character; for a script to correctly work with a UTF-16 string that contains surrogate pairs, it must ensure that the individual codepoints of a surrogate pair are not separated, e.g., by wrapping each codepoint as if it were a distinct character by analogy, this is the same as if a UTF-8 encoded byte sequence were separated into individual bytes, wrapping each byte as if it were a distinct character; both behaviors are incorrect if i were implementing this, i would throw a SYNTAX_ERR exception or equivalent when writing the textContent property if the new value were not a well-formed UTF-16 string, i.e., contains no isolated surrogate code points > > Would you really expect the browser to convert some of the codepoints to > \ufffd when the script does that? That seems like it would violate the > principle of least surprise. It would also mean that the script in > question would work just fine in initial testing then break as soon as a > user entered some non-BMP characters. > assuming for a moment that the user enters a non-BMP character via a input method that maps to a complete surrogate pair (which would be required if it were in fact a defined non-BMP character), then the script would need to ensure it did not erringly separate or wrap individual codepoints of the surrogate pair in such a way as to separate them > > I think we owe it to users to make this case work as Gecko does. if the implementation allows users to set textContent to a single surrogate codepoint, then I believe it is doing a dis-service to the users (and possibly introducing a security risk) by providing a way to enter non-well formed character content
Received on Monday, 16 January 2012 19:24:06 UTC