Re: [css-fonts-3] i18n-ISSUE-296: Usable characters in unicode-range from Anne van Kesteren on 2013-09-13 (www-style@w3.org from September 2013)

From: Anne van Kesteren <annevk@annevk.nl>
Date: Fri, 13 Sep 2013 12:22:51 +0100
To: Jonathan Kew <jfkthame@googlemail.com>
Cc: John Daggett <jdaggett@mozilla.com>, Addison Phillips <addison@lab126.com>, Richard Ishida <ishida@w3.org>, W3C Style <www-style@w3.org>, www International <www-international@w3.org>
Message-ID: <CADnb78gqa_uD5SHFV6rpp6jYxBQmTiaqreLx-zOf-sVjsZ+8DQ@mail.gmail.com>

On Fri, Sep 13, 2013 at 11:33 AM, Jonathan Kew <jfkthame@googlemail.com> wrote:
> This is a tricky issue, IMO. What would it mean for the rendering subsystem
> to "treat lone surrogates as errors", exactly?

Basically to treat them as if U+FFFD was passed. That's how we deal
with them in the encoding layer and in character references and such.


> We don't want the presence of
> a lone surrogate to cause the rendering system to bail and refuse to render
> the remainder of the text run, for example. Nor do we want the lone
> surrogate to be completely ignored; its presence in the data is liable to
> interfere with other processes, so it's useful for the user to be able to
> see that there's *something* there.

Agreed.


> Rendering as U+FFFD might be an option, but IMO rendering as a hexbox is
> actually better. Note that because JS can manipulate the text in terms of
> UTF-16 code units (NOT characters), it is possible for it to "reassemble"
> two separate isolated surrogates into a single valid Unicode character; so
> the fact that the isolated surrogates still retain their distinct
> identities, rather than all appearing as U+FFFD glyphs, makes it easier to
> understand what is happening in such cases. If all isolated surrogates are
> rendered indistinguishably, then the behavior whereby bringing two of them
> into contact "magically" produces a valid Unicode character - but which
> particular one is impossible to predict from what was displayed - seems far
> more mysterious.

I guess my point of view is that I'd rather not have 16-bit code units
leak through to places that could do without. It's a fair argument
though. I guess the flipside would be to embrace the 16-bit code unit
nature of the web and just define everything in terms of that.


> However, all this is straying rather far from the specific issue of
> unicode-range, for which I suggest that surrogate codepoints are simply
> irrelevant, as they should not go through font-matching as individual
> codepoints at all.

Well, if you argue we want to render lone surrogates, I would argue it
makes sense to design a different font for them too. I'm not entirely
convinced we want to render them though.


-- 
http://annevankesteren.nl/

Received on Friday, 13 September 2013 11:23:19 UTC