- From: Jonathan Kew <jfkthame@googlemail.com>
- Date: Fri, 13 Sep 2013 11:33:05 +0100
- To: Anne van Kesteren <annevk@annevk.nl>
- CC: John Daggett <jdaggett@mozilla.com>, Addison Phillips <addison@lab126.com>, Richard Ishida <ishida@w3.org>, W3C Style <www-style@w3.org>, www International <www-international@w3.org>
On 13/9/13 10:01, Anne van Kesteren wrote: > On Fri, Sep 13, 2013 at 9:46 AM, Jonathan Kew <jfkthame@googlemail.com> wrote: >> Given that it's possible for script to insert lone surrogates into the DOM, >> I think we have to "render" them in some way - though simply rendering a >> hexbox, a "broken character" graphic, or perhaps U+FFFD, would be >> sufficient; there's no need to even attempt font matching as though they >> were actual characters. > > So Gecko doesn't do U+FFFD, but I would prefer it if we did. In > particular, if the rendering subsystem would only operate on Unicode > scalar values and treat lone surrogates as errors, I think that'd be > an improvement. This is a tricky issue, IMO. What would it mean for the rendering subsystem to "treat lone surrogates as errors", exactly? We don't want the presence of a lone surrogate to cause the rendering system to bail and refuse to render the remainder of the text run, for example. Nor do we want the lone surrogate to be completely ignored; its presence in the data is liable to interfere with other processes, so it's useful for the user to be able to see that there's *something* there. Rendering as U+FFFD might be an option, but IMO rendering as a hexbox is actually better. Note that because JS can manipulate the text in terms of UTF-16 code units (NOT characters), it is possible for it to "reassemble" two separate isolated surrogates into a single valid Unicode character; so the fact that the isolated surrogates still retain their distinct identities, rather than all appearing as U+FFFD glyphs, makes it easier to understand what is happening in such cases. If all isolated surrogates are rendered indistinguishably, then the behavior whereby bringing two of them into contact "magically" produces a valid Unicode character - but which particular one is impossible to predict from what was displayed - seems far more mysterious. Wherever possible, we should try to prevent isolated surrogates entering the data in the first place - e.g., replacing them with U+FFFD when parsing a Unicode text stream seems entirely appropriate. However, as long as JS exposes the UTF-16 encoding form, and allows arbitrary manipulation of the code units, the rendering system needs to handle them somehow, and IMO rendering them as hexboxes that indicate the actual code units present is more useful than rendering them all with a generic "error" indicator such as U+FFFD, which hides a potentially important clue as to what may happen if they are recombined. A "real" U+FFFD character in the text will never interact with another U+FFFD to produce a new Unicode letter, whereas two isolated surrogates may well do so; they should not appear the same. However, all this is straying rather far from the specific issue of unicode-range, for which I suggest that surrogate codepoints are simply irrelevant, as they should not go through font-matching as individual codepoints at all. JK
Received on Friday, 13 September 2013 10:33:40 UTC