Re: [css3-text] tweak the definition of a grapheme cluster a bit for UTF-16 from Jonathan Kew on 2012-01-17 (www-style@w3.org from January 2012)

From: Jonathan Kew <jonathan@jfkew.plus.com>
Date: Tue, 17 Jan 2012 09:17:02 +0000
To: www-style Style <www-style@w3.org>
Message-Id: <1068A7A5-43F4-4BD0-B458-4F06D12F10C3@jfkew.plus.com>

On 17 Jan 2012, at 02:45, fantasai wrote:
>> If CSS were in pure Unicode, then this suggested that the document tree,
>> the terminology used in CSS2.1, is in pure Unicode, then we wouldn't
>> have been presented questions like what should UA do if a non-BMP
>> character crosses the element boundary, as being discussed by Boris and
>> Glenn. HTML is unlikely to be the layer to address this problem too (how
>> would HTML+DOM gives CSS a document tree in pure Unicode?)
> 
> Seems to me that would fall under the "grapheme cluster split by an
> element boundary" case, no?

No, I don't think so - "grapheme cluster [should not be] split by an element boundary" is a much more sweeping restriction (and one that I'm not keen to see imposed) than requiring that a "non-BMP character [in sense of a single encoded character U+XXXXXX from the Unicode character set]" should not be split. Don't conflate these cases.

The latter is clearly (IMO) a reasonable restriction; in an ideal world, Unicode characters would be indivisible entities and the question simply wouldn't arise, and the internal representation used in any particular UA implementation would not be exposed to users at all. Splitting a Unicode character across an element boundary _presumes_ a particular encoding form, and cannot in general be converted between encoding forms without loss - a good indication that it represents a fundamental misuse of Unicode at a structural level. Ideally, it would be impossible for authors to create such data; unfortunately, Javascript exposes text as UTF-16 and lets authors munge it at the level of individual code units (rather than characters).

Splitting a grapheme cluster into its individual Unicode characters, on the other hand, is entirely legitimate and does not conflict with Unicode principles - although it's true that once you decide to style the individual components of the cluster in different ways, it may become difficult to determine what the desired rendering ought to be, let alone implement it. As such, it's reasonable to acknowledge that there may be implementation differences in the level of support for this, but it should not be lumped in the same (strongly discouraged, we-wish-we-could-utterly-prevent-it) category as splitting a single character [in a particular encoding form] across a boundary.

JK

Received on Tuesday, 17 January 2012 09:17:40 UTC