Re: [css3-fonts] unicode-range and unicode normalization from John Daggett on 2010-07-12 (www-style@w3.org from July 2010)

From: John Daggett <jdaggett@mozilla.com>
Date: Sun, 11 Jul 2010 20:43:02 -0700 (PDT)
To: Yuzo Fujishima <yuzo@google.com>
Cc: www-style@w3.org, www-font <www-font@w3.org>
Message-ID: <2101956867.58251.1278906181941.JavaMail.root@cm-mail03.mozilla.org>

Yuzo Fujishima wrote:

> What unicode normalization (http://unicode.org/reports/tr15/) must be
> applied to the characters in an HTML document before matching against
> the unicode-range descriptor
> (http://dev.w3.org/csswg/css3-fonts/#unicode-range-desc)?
> 
> A. No normalization at all. All the codepoints are checked against unicode-range as-is.
> B. Undefined. Whether to apply normalization is up to UA.
> C. Must be normalized to NFC
> D. Must be normalized to NFD
> E. Must be normalized to NFKC
> F. Must be normalize to NFKD 
>
> In my opinion, A (or B) is the most realistic choice, seeing that
> Chrome 6, Safari 6, IE8, and Opera 10 don't normalize stylesheets.
> (Firefox 6 doesn't seem to be working in this respect.)
> http://www.w3.org/International/tests/tests-html-css/tests-normalization/generate?test=10&serveas=xml&format=xhtml5

The short answer is probably (C) strings should be NFC normalized
before the font selection process is run, with some caveats listed
below.

The underlying question here is whether normalization is applied to a
character stream before the font selection algorithm is run,
unicode-range is just a part of that process.  That's actually
independent of whether stylesheet data is normalized or not, the font
selection process maps content character streams to font character
maps, there aren't the same string equivalence issues.

The font matching algorithm in CSS has always been described in
relation to "characters", precisely how combining characters affect
font fallback is unspecified.  Fonts can support combined forms and
combining forms or just one and not the other (example: a font can
have a glyph for 'a-ring' along with a glyph for 'a' and 'combining
ring', so there are multiple ways to select appropriate glyphs for
"Håkon").  

So the answer to your question isn't quite as simple as
specifying a given normalization.  If a glyph for the combined
codepoint exists in the font, using that glyph is probably best. 
Otherwise, ideally the base character and combining character should
come from the same font, that assures correct placement of the
combining character.  In the case where the combined character is not
included in the cmap but both the base character and combining
character are included, I don't think it makes sense to try to do font
matching on the decomposition of base character + combining character,
I think you'd end up testing for situations that rarely existed and 
for which the results would not be guaranteed to be correct anyways.

So I think we can specify common cases that should match but there are
some cases where it might be better left to UA's to deal with
appropriately. I'm cc'ing the fonts list in case anyone there feels
otherwise.

Regards,

John Daggett

Received on Monday, 12 July 2010 03:43:36 UTC