Re: [css-fonts-3] i18n-ISSUE-296: Usable characters in unicode-range from John C Klensin on 2013-09-13 (www-international@w3.org from July to September 2013)

From: John C Klensin <john+w3c@jck.com>
Date: Fri, 13 Sep 2013 07:27:06 -0400
To: Jonathan Kew <jfkthame@googlemail.com>, Anne van Kesteren <annevk@annevk.nl>
cc: John Daggett <jdaggett@mozilla.com>, Addison Phillips <addison@lab126.com>, Richard Ishida <ishida@w3.org>, W3C Style <www-style@w3.org>, www International <www-international@w3.org>
Message-ID: <DAA86A5F225705E1A25C91F0@JcK-HP8200.jck.com>

--On Friday, September 13, 2013 11:33 +0100 Jonathan Kew
<jfkthame@googlemail.com> wrote:

> On 13/9/13 10:01, Anne van Kesteren wrote:
>> On Fri, Sep 13, 2013 at 9:46 AM, Jonathan Kew
>> <jfkthame@googlemail.com> wrote:
>>> Given that it's possible for script to insert lone
>>> surrogates into the DOM, I think we have to "render" them in
>>> some way - though simply rendering a hexbox, a "broken
>>> character" graphic, or perhaps U+FFFD, would be sufficient;
>>> there's no need to even attempt font matching as though they
>>> were actual characters.
>> 
>> So Gecko doesn't do U+FFFD, but I would prefer it if we did.
>> In particular, if the rendering subsystem would only operate
>> on Unicode scalar values and treat lone surrogates as errors,
>> I think that'd be an improvement.
> 
> This is a tricky issue, IMO. What would it mean for the
> rendering subsystem to "treat lone surrogates as errors",
> exactly? We don't want the presence of a lone surrogate to
> cause the rendering system to bail and refuse to render the
> remainder of the text run, for example. Nor do we want the
> lone surrogate to be completely ignored; its presence in the
> data is liable to interfere with other processes, so it's
> useful for the user to be able to see that there's *something*
> there.
> 
> Rendering as U+FFFD might be an option, but IMO rendering as a
> hexbox is actually better. Note that because JS can manipulate
> the text in terms of UTF-16 code units (NOT characters), it is
> possible for it to "reassemble" two separate isolated
> surrogates into a single valid Unicode character; so the fact
> that the isolated surrogates still retain their distinct
> identities, rather than all appearing as U+FFFD glyphs, makes
> it easier to understand what is happening in such cases. If
> all isolated surrogates are rendered indistinguishably, then
> the behavior whereby bringing two of them into contact
> "magically" produces a valid Unicode character - but which
> particular one is impossible to predict from what was
> displayed - seems far more mysterious.
>...
> However, all this is straying rather far from the specific
> issue of unicode-range, for which I suggest that surrogate
> codepoints are simply irrelevant, as they should not go
> through font-matching as individual codepoints at all.

Yes, but, while everyone who does a UTF-16 to code point
conversion is supposed to know about surrogate decoding and
probably does, I'm less confident about UTF-32 where, I believe,
conversions to code points tend to be naive.  Surrogate
introducers should never appear in UTF-32, but we shouldn't
encounter them in UTF-8 and, btw, shouldn't encounter
non-short-form UTF-8 either.  The problem is what to do about
them if they show up and, as we discussed on the I18n call
yesterday, whether hiding that question behind "valid" is
helpful.

To generalize this a bit, there is the issue of how to "render"
an unassigned code point and whether, if strings as delivered in
some encoding are calculated into code points or treated as the
original form, the "rendering" of an unassigned code point
should be the same as the display of surrogate introducers. 

I suggest that the discussion and issues could be clarified by
your inventing some terminology to better distinguish at least
some of the following cases from each other:

 * Invalid Unicode (e.g., U+11FFFF)
 * Invalid encoding (e.g., surrogate introducers in UTF-8
	or UTF-32)
 * Unassigned code points (there is, by definition, no
	correct rendering for these, only the possibility of
	substituting an indicator rather than refusing to map
	the display into graphics at all)
 * Rendering substitutions including substituting
	near-match glyphs from the distinctive ones implied by
	the code point sequence and substitution of alternate
	type styles with richer font sets in one of more glyph
	positions (or perhaps those two should be further
	substituted)
 * Strings that could be rendered the way the entity the
	constructed the string intended (i.e., all of the needed
	glyphs or graphemes are present in the font) but for
	which the application lacks (or might lack) sufficient
	rendering capability (e.g., half characters and
	position-dependent forms that are not distinguished in
	the encoding may cause this issue)
 * String that can and will be rendered as the
	string-creator anticipated.

I'd like to see us get rid of "valid" entirely because, as we
have seen, it is just too subject to different informal
interpretations by different people.  But the categories after
the first two or three all have to do with whether a string can
be rendered satisfactorily (and, obviously, what
"satisfactorily" means in context) and not with "validity".

It is a somewhat different issue but, coming back to Jonathan's
comment quoted above, I think one can make the case that a
string that contains sufficiently invalid Unicode is an invalid
string and that all of it should be shown as hexboxes or
equivalent because the application cannot really know what was
intended and telling the user that may be better than guessing.
By contrast, if an individual code point sequence is completely
understood but cannot be rendered as intended, substitutions of
hexboxes or other warning symbols for it may be reasonable, even
as part of a longer string.

 best,
    john

Received on Friday, 13 September 2013 11:27:44 UTC