- From: John C Klensin <john+w3c@jck.com>
- Date: Fri, 13 Sep 2013 07:27:06 -0400
- To: Jonathan Kew <jfkthame@googlemail.com>, Anne van Kesteren <annevk@annevk.nl>
- cc: John Daggett <jdaggett@mozilla.com>, Addison Phillips <addison@lab126.com>, Richard Ishida <ishida@w3.org>, W3C Style <www-style@w3.org>, www International <www-international@w3.org>
--On Friday, September 13, 2013 11:33 +0100 Jonathan Kew <jfkthame@googlemail.com> wrote: > On 13/9/13 10:01, Anne van Kesteren wrote: >> On Fri, Sep 13, 2013 at 9:46 AM, Jonathan Kew >> <jfkthame@googlemail.com> wrote: >>> Given that it's possible for script to insert lone >>> surrogates into the DOM, I think we have to "render" them in >>> some way - though simply rendering a hexbox, a "broken >>> character" graphic, or perhaps U+FFFD, would be sufficient; >>> there's no need to even attempt font matching as though they >>> were actual characters. >> >> So Gecko doesn't do U+FFFD, but I would prefer it if we did. >> In particular, if the rendering subsystem would only operate >> on Unicode scalar values and treat lone surrogates as errors, >> I think that'd be an improvement. > > This is a tricky issue, IMO. What would it mean for the > rendering subsystem to "treat lone surrogates as errors", > exactly? We don't want the presence of a lone surrogate to > cause the rendering system to bail and refuse to render the > remainder of the text run, for example. Nor do we want the > lone surrogate to be completely ignored; its presence in the > data is liable to interfere with other processes, so it's > useful for the user to be able to see that there's *something* > there. > > Rendering as U+FFFD might be an option, but IMO rendering as a > hexbox is actually better. Note that because JS can manipulate > the text in terms of UTF-16 code units (NOT characters), it is > possible for it to "reassemble" two separate isolated > surrogates into a single valid Unicode character; so the fact > that the isolated surrogates still retain their distinct > identities, rather than all appearing as U+FFFD glyphs, makes > it easier to understand what is happening in such cases. If > all isolated surrogates are rendered indistinguishably, then > the behavior whereby bringing two of them into contact > "magically" produces a valid Unicode character - but which > particular one is impossible to predict from what was > displayed - seems far more mysterious. >... > However, all this is straying rather far from the specific > issue of unicode-range, for which I suggest that surrogate > codepoints are simply irrelevant, as they should not go > through font-matching as individual codepoints at all. Yes, but, while everyone who does a UTF-16 to code point conversion is supposed to know about surrogate decoding and probably does, I'm less confident about UTF-32 where, I believe, conversions to code points tend to be naive. Surrogate introducers should never appear in UTF-32, but we shouldn't encounter them in UTF-8 and, btw, shouldn't encounter non-short-form UTF-8 either. The problem is what to do about them if they show up and, as we discussed on the I18n call yesterday, whether hiding that question behind "valid" is helpful. To generalize this a bit, there is the issue of how to "render" an unassigned code point and whether, if strings as delivered in some encoding are calculated into code points or treated as the original form, the "rendering" of an unassigned code point should be the same as the display of surrogate introducers. I suggest that the discussion and issues could be clarified by your inventing some terminology to better distinguish at least some of the following cases from each other: * Invalid Unicode (e.g., U+11FFFF) * Invalid encoding (e.g., surrogate introducers in UTF-8 or UTF-32) * Unassigned code points (there is, by definition, no correct rendering for these, only the possibility of substituting an indicator rather than refusing to map the display into graphics at all) * Rendering substitutions including substituting near-match glyphs from the distinctive ones implied by the code point sequence and substitution of alternate type styles with richer font sets in one of more glyph positions (or perhaps those two should be further substituted) * Strings that could be rendered the way the entity the constructed the string intended (i.e., all of the needed glyphs or graphemes are present in the font) but for which the application lacks (or might lack) sufficient rendering capability (e.g., half characters and position-dependent forms that are not distinguished in the encoding may cause this issue) * String that can and will be rendered as the string-creator anticipated. I'd like to see us get rid of "valid" entirely because, as we have seen, it is just too subject to different informal interpretations by different people. But the categories after the first two or three all have to do with whether a string can be rendered satisfactorily (and, obviously, what "satisfactorily" means in context) and not with "validity". It is a somewhat different issue but, coming back to Jonathan's comment quoted above, I think one can make the case that a string that contains sufficiently invalid Unicode is an invalid string and that all of it should be shown as hexboxes or equivalent because the application cannot really know what was intended and telling the user that may be better than guessing. By contrast, if an individual code point sequence is completely understood but cannot be rendered as intended, substitutions of hexboxes or other warning symbols for it may be reasonable, even as part of a longer string. best, john
Received on Friday, 13 September 2013 11:27:44 UTC