W3C home > Mailing lists > Public > www-style@w3.org > January 2012

[css3-text] tweak the definition of a grapheme cluster a bit for UTF-16

From: Kang-Hao (Kenny) Lu <kennyluck@csail.mit.edu>
Date: Mon, 16 Jan 2012 19:36:22 +0800
Message-ID: <4F140BB6.5030703@csail.mit.edu>
To: WWW Style <www-style@w3.org>, WWW International <www-international@w3.org>
Conceptually, UAX#29, on which the definition of a grapheme cluster in
CSS3 Text relies upon, operates on a string of Unicode code points,
while the DOM is in reality UTF-16. Although it is quite obvious what
conversion should happen, it might be nice to say a little bit about
this. A normative result from this clarification would be to ask UA to
render a single emphasis dot instead of two in the following case

<span style="text-emphasis: dots">(U+D840, U+DC87)</span> (a random
ideograph out of BMP)

As the HTML spec defines the term "Unicode code point"[1]

[[
The term Unicode code point means a Unicode scalar value where possible,
and an isolated surrogate code point when not. When a conformance
requirement is defined in terms of characters or Unicode code points, a
pair of code units consisting of a high surrogate followed by a low
surrogate must be treated as the single code point represented by the
surrogate pair, but isolated surrogates must each be treated as the
single code point with the value of the surrogate.
]]

, I think CSS3 Text can adopt this prose somewhere in the spec, perhaps
near the definition of a grapheme cluster, and make it undefined as to
what should happen if isolated surrogates are encountered. See [2] for
such an example.

[1]
http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#unicode-code-point
[2] http://lists.w3.org/Archives/Public/www-style/2012Jan/0556

Cheers,
Kenny
Received on Monday, 16 January 2012 11:37:37 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 17:20:48 GMT