W3C home > Mailing lists > Public > www-style@w3.org > January 2012

Re: [css3-text] tweak the definition of a grapheme cluster a bit for UTF-16

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Tue, 17 Jan 2012 11:51:04 +0900
Message-ID: <4F14E218.6080302@it.aoyama.ac.jp>
To: "Kang-Hao (Kenny) Lu" <kennyluck@csail.mit.edu>
CC: WWW Style <www-style@w3.org>, WWW International <www-international@w3.org>
Hello Kenny,

On 2012/01/16 20:36, Kang-Hao (Kenny) Lu wrote:
> Conceptually, UAX#29, on which the definition of a grapheme cluster in
> CSS3 Text relies upon, operates on a string of Unicode code points,
> while the DOM is in reality UTF-16. Although it is quite obvious what
> conversion should happen, it might be nice to say a little bit about
> this. A normative result from this clarification would be to ask UA to
> render a single emphasis dot instead of two in the following case
>
> <span style="text-emphasis: dots">(U+D840, U+DC87)</span>  (a random
> ideograph out of BMP)
>
> As the HTML spec defines the term "Unicode code point"[1]
>
> [[
> The term Unicode code point means a Unicode scalar value where possible,
> and an isolated surrogate code point when not. When a conformance
> requirement is defined in terms of characters or Unicode code points, a
> pair of code units consisting of a high surrogate followed by a low
> surrogate must be treated as the single code point represented by the
> surrogate pair, but isolated surrogates must each be treated as the
> single code point with the value of the surrogate.
> ]]
>
> , I think CSS3 Text can adopt this prose somewhere in the spec, perhaps
> near the definition of a grapheme cluster, and make it undefined as to
> what should happen if isolated surrogates are encountered. See [2] for
> such an example.

My guess is that the HTML spec came up with a special term (it should be 
"UTF-16 code unit", rather than "Unicode code point", but that's a 
separate issue) because in many cases, they define their algorithms and 
procedures in a very low-level fashion.

I don't think this is the case currently for CSS, and I don't think 
there is any need to go there for CSS. It would therefore be much better 
to say clearly in the CSS spec that it deals with Unicode scalar values, 
and leave surrogates as an implementation matter.

Regards,    Martin.

> [1]
> http://www.whatwg.org/specs/web-apps/current-work/multipage/infrastructure.html#unicode-code-point
> [2] http://lists.w3.org/Archives/Public/www-style/2012Jan/0556
>
> Cheers,
> Kenny
>
>
Received on Tuesday, 17 January 2012 02:51:42 GMT

This archive was generated by hypermail 2.3.1 : Tuesday, 26 March 2013 17:20:48 GMT