W3C home > Mailing lists > Public > www-style@w3.org > January 2012

Re: [css3-text] tweak the definition of a grapheme cluster a bit for UTF-16

From: Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu>
Date: Tue, 17 Jan 2012 08:49:25 +0800
Message-ID: <4F14C595.10006@csail.mit.edu>
To: fantasai <fantasai.lists@inkedblade.net>
CC: WWW Style <www-style@w3.org>
Practically speaking, there are two interoperability-related issues that
apply to browsers here:
1. UAs using UTF-16 as internal storage treat a non-BMP character as two
grapheme clusters. I am aware that this is unlikely to happen so I'll
stop talking about this possibility.
2. UAs render content with isolated surrogate differently. This already
happened[1]. If you find other ways to address this problem (by either
marking it as undefined or forbid certain behavior) then I think I am
satisfied. That is, I don't want WebKit's behavior to fall into the "UA
may further tailor the definition (grapheme cluster) as allowed by
Unicode." allowance. UAs should not be allowed count a element starting
with an isolated surrogate as having zero grapheme clusters so to speak.

The following is theoretical, you can treat this as asking no actions
and ignore it.

(12/01/17 7:16), fantasai wrote:
> On 01/16/2012 03:36 AM, Kang-Hao (Kenny) Lu wrote:
>> Conceptually, UAX#29, on which the definition of a grapheme cluster in
>> CSS3 Text relies upon, operates on a string of Unicode code points,
>> while the DOM is in reality UTF-16. Although it is quite obvious what
>> conversion should happen, it might be nice to say a little bit about
>> this. A normative result from this clarification would be to ask UA to
>> render a single emphasis dot instead of two in the following case
> I feel like this should be covered by Unicode already, are you saying
> it's not?
> We assume Unicode in CSS 

First of all, can you point to me which spec has a statement like this?
I couldn't find such a statement in either CSS2.1 or CSS3 Text.

If CSS were in pure Unicode, then this suggested that the document tree,
the terminology used in CSS2.1, is in pure Unicode, then we wouldn't
have been presented questions like what should UA do if a non-BMP
character crosses the element boundary, as being discussed by Boris and
Glenn. HTML is unlikely to be the layer to address this problem too (how
would HTML+DOM gives CSS a document tree in pure Unicode?)

> so that we can talk about particular rendering
> requirements, but we don't say anything about the encoding: that's up to
> the UA. It can store things as UCS-32 if it wants to, use SHIFT-JIS
> internally, or only support ASCII documents. Doesn't matter.

Yeah, I kind of agree we could make the CSS specs as encoding irrelevant
as possible. I guess we can start a CSS for UTF-16 UA module 10 years
later, if we finally want to standardize Gecko's behavior on non-BMP
characters crossing element boundary :p .

[1] http://lists.w3.org/Archives/Public/www-style/2012Jan/0556

Received on Tuesday, 17 January 2012 00:49:57 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 16:28:37 UTC