Re: [css3-text] tweak the definition of a grapheme cluster a bit for UTF-16 from Boris Zbarsky on 2012-01-17 (www-style@w3.org from January 2012)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Mon, 16 Jan 2012 20:34:27 -0500
To: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>
CC: fantasai <fantasai.lists@inkedblade.net>, WWW Style <www-style@w3.org>
Message-ID: <4F14D023.6040105@mit.edu>

On 1/16/12 7:49 PM, Kang-Hao (Kenny) Lu wrote:
> Practically speaking, there are two interoperability-related issues that
> apply to browsers here:
> 1. UAs using UTF-16 as internal storage treat a non-BMP character as two
> grapheme clusters. I am aware that this is unlikely to happen so I'll
> stop talking about this possibility.
> 2. UAs render content with isolated surrogate differently.

Are there only two?  I guess when talking about UTF-16 related issuus in 
particuular

Again, the behavior of Gecko for surrogates just falls out from the 
general approach to text rendering: A consecutive run of text that all 
has the "same" style but might span different elements (whatever that 
means; that's another fun discussion) is treated as a single unit for 
purposes of text rendering.  That means that things like handling of 
composing characters, ligatures, shaping, etc happens on it all as a unit.

Compare the behavior of this testcase in different browsers (and pardon 
the probably-nonsense text):

<!DOCTYPE html>
<body style="font-size: 40px">
   &#x628; &#x62A;<br>
   &#x628;&#x62A;<br>
   <span>&#x628;</span><span>&#x62A;</span><br>
   <span style="color: green">&#x628;</span><span style="color: 
purple">&#x62A;</span><br>
   <span style="font-size: 41px">&#x628;</span>&#x62A;

In Gecko and Trident I see shaping happen for all but the first and last 
lines of text.  In the first line it should obviously not happen; in the 
last line Gecko doesn't do it because it's not really clear how to shape 
two glyphs from different font sizes.  I can't speak for Trident there, 
though I bet the causes for its behavior are similar.

In WebKit and Presto, only the second line of text is shaped over here. 
  I would argue that's wrong, especially for the third line of text.

> That is, I don't want WebKit's behavior to fall into the "UA
> may further tailor the definition (grapheme cluster) as allowed by
> Unicode."

I'm not sure that would cover shaping anyway, or would it?

> Yeah, I kind of agree we could make the CSS specs as encoding irrelevant
> as possible. I guess we can start a CSS for UTF-16 UA module 10 years
> later, if we finally want to standardize Gecko's behavior on non-BMP
> characters crossing element boundary :p .

Handling the non-BMP case explicitly would be nice, but we have other 
non-interop across element boundaries too.  It might turn out, as in 
Gecko's case, that simply trying to solve those other use cases ends up 
Just Working for the common non-BMP cases....

-Boris

Received on Tuesday, 17 January 2012 01:40:21 UTC