- From: Glenn Adams <glenn@skynav.com>
- Date: Mon, 16 Jan 2012 13:06:06 -0700
- To: Boris Zbarsky <bzbarsky@mit.edu>
- Cc: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>, WWW Style <www-style@w3.org>
- Message-ID: <CACQ=j+dVp2DNcn9H2XWiGRZodXgnmNu_v88GMPDpzeufddGiYA@mail.gmail.com>
the problems you refer to appear (to me) to derive only from two scenarios: (1) some implementation allowing a user to enter codepoints directly (as opposed to entire characters), which could possibly result in isolated surrogate codepoints; (2) script that naively assumes codepoint = character, and inadvertently separates surrogate pair elements; regarding (1), i'm not sure this is a legitimate, or if legitimate, then wise feature for an implementation to provide to naive end users; it could clearly result in security problems of various kinds; regarding (2), my position is that the implementation should be conservative and not liberal when allowing script to set certain property values whose underlying semantics imply a well-formed UTF-16 string; so, yes, were I implementing this, I would throw an exception when a script attempts to set a DOMString typed property to a JS String that contains an isolated surrogate codepoint; or at least I would do this by default, and only depart from this default in certain circumscribed cases; this is my answer to your question "what the best way to limit damage from the lack of understanding on script authors' part is"; i certainly would not adopt an approach which silently allows (or even encourages) the proliferation of isolated surrogate code points; of course, this could be resolved, at least in theory, but upgrading JS/ES to use unicode scalar values instead of UTF-16 codepoints, but we know this won't happen, at least anytime soon (if ever); On Mon, Jan 16, 2012 at 12:33 PM, Boris Zbarsky <bzbarsky@mit.edu> wrote: > On 1/16/12 2:23 PM, Glenn Adams wrote: > >> the problem is that in this case codepoint != character; for a script to >> correctly work with a UTF-16 string that contains surrogate pairs, it >> must ensure that the individual codepoints of a surrogate pair are not >> separated, e.g., by wrapping each codepoint as if it were a distinct >> character >> > > You know that. I know that. Script authors have no clue, for the most > part. > > > if i were implementing this, i would throw a SYNTAX_ERR exception or >> equivalent when writing the textContent property if the new value were >> not a well-formed UTF-16 string, i.e., contains no isolated surrogate >> code points >> > > And you'd throw exceptions from using splitText in between surrogates and > so forth? So the upshot would be that the script would always work when > the author tested it, but fail on some user input? > > Why is that better for either users or authors? > > > Would you really expect the browser to convert some of the >> codepoints to \ufffd when the script does that? That seems like it >> would violate the principle of least surprise. It would also mean >> that the script in question would work just fine in initial testing >> then break as soon as a user entered some non-BMP characters. >> >> assuming for a moment that the user enters a non-BMP character via a >> input method that maps to a complete surrogate pair >> > > Of course it does. The user enters characters; how the browser stores > those internally is the browser's business, but it better do it "right". > > > then the script would need to ensure it did not erringly separate or wrap >> individual codepoints of the surrogate pair in such a way as to separate >> them >> > > But the script does nothing of the sort, in practice. Now what? > > > I think we owe it to users to make this case work as Gecko does. >> >> if the implementation allows users to set textContent to a single >> surrogate codepoint, then I believe it is doing a dis-service to the >> users (and possibly introducing a security risk) by providing a way to >> enter non-well formed character content >> > > The implementation allows users to type in text. The implementation > allows scripts written by people who don't have a good concept of > non-ASCII, much less non-BMP, characters to operate on that text. The only > question is what the best way to limit damage from the lack of > understanding on script authors' part is. > > -Boris > >
Received on Monday, 16 January 2012 20:06:59 UTC