Re: [css-fonts-3] i18n-ISSUE-296: Usable characters in unicode-range from Jonathan Kew on 2013-09-13 (www-international@w3.org from July to September 2013)

From: Jonathan Kew <jfkthame@googlemail.com>
Date: Fri, 13 Sep 2013 11:33:05 +0100
To: Anne van Kesteren <annevk@annevk.nl>
CC: John Daggett <jdaggett@mozilla.com>, Addison Phillips <addison@lab126.com>, Richard Ishida <ishida@w3.org>, W3C Style <www-style@w3.org>, www International <www-international@w3.org>
Message-ID: <5232E9E1.2060205@gmail.com>

On 13/9/13 10:01, Anne van Kesteren wrote:
> On Fri, Sep 13, 2013 at 9:46 AM, Jonathan Kew <jfkthame@googlemail.com> wrote:
>> Given that it's possible for script to insert lone surrogates into the DOM,
>> I think we have to "render" them in some way - though simply rendering a
>> hexbox, a "broken character" graphic, or perhaps U+FFFD, would be
>> sufficient; there's no need to even attempt font matching as though they
>> were actual characters.
>
> So Gecko doesn't do U+FFFD, but I would prefer it if we did. In
> particular, if the rendering subsystem would only operate on Unicode
> scalar values and treat lone surrogates as errors, I think that'd be
> an improvement.

This is a tricky issue, IMO. What would it mean for the rendering 
subsystem to "treat lone surrogates as errors", exactly? We don't want 
the presence of a lone surrogate to cause the rendering system to bail 
and refuse to render the remainder of the text run, for example. Nor do 
we want the lone surrogate to be completely ignored; its presence in the 
data is liable to interfere with other processes, so it's useful for the 
user to be able to see that there's *something* there.

Rendering as U+FFFD might be an option, but IMO rendering as a hexbox is 
actually better. Note that because JS can manipulate the text in terms 
of UTF-16 code units (NOT characters), it is possible for it to 
"reassemble" two separate isolated surrogates into a single valid 
Unicode character; so the fact that the isolated surrogates still retain 
their distinct identities, rather than all appearing as U+FFFD glyphs, 
makes it easier to understand what is happening in such cases. If all 
isolated surrogates are rendered indistinguishably, then the behavior 
whereby bringing two of them into contact "magically" produces a valid 
Unicode character - but which particular one is impossible to predict 
from what was displayed - seems far more mysterious.

Wherever possible, we should try to prevent isolated surrogates entering 
the data in the first place - e.g., replacing them with U+FFFD when 
parsing a Unicode text stream seems entirely appropriate. However, as 
long as JS exposes the UTF-16 encoding form, and allows arbitrary 
manipulation of the code units, the rendering system needs to handle 
them somehow, and IMO rendering them as hexboxes that indicate the 
actual code units present is more useful than rendering them all with a 
generic "error" indicator such as U+FFFD, which hides a potentially 
important clue as to what may happen if they are recombined. A "real" 
U+FFFD character in the text will never interact with another U+FFFD to 
produce a new Unicode letter, whereas two isolated surrogates may well 
do so; they should not appear the same.

However, all this is straying rather far from the specific issue of 
unicode-range, for which I suggest that surrogate codepoints are simply 
irrelevant, as they should not go through font-matching as individual 
codepoints at all.

JK

Received on Friday, 13 September 2013 10:33:38 UTC