[css3-text] text-transform misc issues (was RE: [css3-writing-modes] referring to Unicode from Koji Ishii on 2011-05-08 (www-international@w3.org from April to June 2011)

From: Koji Ishii <kojiishi@gluesoft.co.jp>
Date: Sun, 8 May 2011 09:05:40 -0400
To: John Daggett <jdaggett@mozilla.com>, Addison Phillips <addison@lab126.com>
CC: fantasai <fantasai.lists@inkedblade.net>, "www-style@w3.org" <www-style@w3.org>, WWW International <www-international@w3.org>
Message-ID: <A592E245B36A8949BDB0A302B375FB4E0AC28756FB@MAILR001.mail.lan>

I'm splitting the thread for text-transform.

> http://dev.w3.org/cvsweb/~checkout~/csswg/css3-text/Overview.html?rev=1.128;content-type=text%2Fhtml#text-transform

> 
> The 'fullwidth' value is defined as:
> 
>     Puts all characters in fullwidth form. If the character does not
>     have a corresponding fullwidth form, it is left as is. This value
>     is typically used to typeset Latin characters and digits like
>     ideographic characters.
> 
> Additional description:
> 
>     The definition of fullwidth and halfwidth forms can be found on
>     the Unicode consortium web site at [UAX11]. The mapping to
>     fullwidth form is defined by <wide> tag of Character Decomposition
>     Mapping in [UAX44].
> 
> But this doesn't really define the precise mapping function, it implies
> it obliquely.  The data in the UnicodeData.txt file looks like this:
> 
> FF41;FULLWIDTH LATIN SMALL LETTER A;Ll;0;L;<wide> 0061;;;;N;;;FF21;;FF21
> FF42;FULLWIDTH LATIN SMALL LETTER B;Ll;0;L;<wide> 0062;;;;N;;;FF22;;FF22
> FF43;FULLWIDTH LATIN SMALL LETTER C;Ll;0;L;<wide> 0063;;;;N;;;FF23;;FF23
> FF44;FULLWIDTH LATIN SMALL LETTER D;Ll;0;L;<wide> 0064;;;;N;;;FF24;;FF24
> 
> The mapping is *from* the codepoint contained in the
> Decomposition_Mapping property when '<wide>' is present.  So 'A'
> (U+0061) would map to it's fullwidth version (U+FF41).

You're right that the mapping is "from" the codepoint, but that's the definition of the Decomposition_Mapping.

  00B9;SUPERSCRIPT ONE;...;<super> 0031;

means "U+00B9 is <super> of U+0031". I can add a non-normative notes how to interpret values of Decomposition_Mapping field with an example. Do you think it'd help?

> When you look at the data you also discover this:
> 
> 3000;IDEOGRAPHIC SPACE;Zs;0;WS;<wide> 0020;;;;N;;;;;
> 
> So the mapping would also map spaces to ideographic spaces.  Since
> this has implications for white space collapsing, the point in the
> text handling pipeline where text-transform occurs needs to be defined
> precisely.  This has been noted as an issue and discussed on www-style
> [1].

You're right. The situation is:
* From use cases, authors want to transform U+0020 to U+3000 after white space processing.
* If it's too hard for implementations, authors can live without since the feature is still useful.
* But at this point, we don't know if it's hard for implementations or not.

So at the last of the spec, fantasai and I added this paragraph:

> Text transformation happens after white space processing.
> (This only matters when ‘fullwidth’ transforms U+0020 space
> characters to U+3000.) Issue:This requirement may need to
> be relaxed during CR, so mark at-risk.

Does this solve your concern?

> The precise behavior of 'uppercase' and 'lowercase' should also
> probably be defined explicitly.  Should only the
> Simple_Uppercase_Mapping and Simple_Lowercase_Mapping properties be
> used?  Or should the properties contained in SpecialCasing.txt also
> apply? (My answer: yes please!).

I agree that we should state properties for these too, and I agree that we should use SpecialCasing.txt as well. Thank you for providing your expected answer beforehand, that really helps. I'll try wording and consult with fantasai.

> Instead the current draft just writes:
> 
>     Although limited, the case mapping process has some language
>     dependencies. Some well known examples are Turkish and Greek. If
>     the content language is known then any such language-specific
>     rules must be used.  The case mapping rules for the character
>     repertoire specified by the Unicode Standard can be found on the
>     Unicode Consortium Web site. [UNICODE]
> 
> This is simply not sufficient to define what 'uppercase' and
> 'lowercase' means in implementation terms.

This paragraph is talking about language dependency of casing algorithm, so I think this should be kept. This is an additional requirements to Simple_Uppercase_Mapping, Simple_Lowercase_Mapping, and SpecialCasing.txt.

> Depending on how you define the case-mapping properties, there's also
> a possible ordering issue, since text-transform can be multi-valued:
> 
>   p { text-transform: fullwidth lowercase; }
> 
>   <p>&#xfb00;</p> /* codepoint for ff presentational ligature */
> 
> Does a viewer see the ff-ligature or fullwidth FF?  This *might* be
> determined by the order in which these mappings are applied.

Great point. I think this works:
  [ capitalize | uppercase | lowercase ] > fullwidth > fullsize-kana
I'll note this to the spec unless anyone has different idea.

> My point here is simply that implementors need more detail than simple
> references to parts of Unicode.

I don't think these are part of a generic issue of how to refer to Unicode, instead, these are really great review feedback. I appreciate for your efforts and knowledge to give us such a great feedback.

> [1] Effect of text-transform on spaces
> http://lists.w3.org/Archives/Public/www-style/2011Feb/thread.html#msg470

[2] http://unicode.org/faq/casemap_charprop.html

Regards,
Koji

Received on Sunday, 8 May 2011 13:08:21 UTC