Re: [css-text] Enclosed alphanumerics and text-align:capitalize from Jonathan Kew on 2015-03-23 (www-style@w3.org from March 2015)

From: Jonathan Kew <jfkthame@gmail.com>
Date: Mon, 23 Mar 2015 18:29:39 +0000
To: www-style@w3.org
Message-ID: <55105B93.40208@gmail.com>
On 18/3/15 03:15, fantasai wrote:
> On 03/10/2015 11:29 AM, Richard Ishida wrote:
>>
>> i was wondering about how to treat enclosed alphanumerics when
>> text-align is set to capitalize.
>>
>> See the test results at
>> http://www.w3.org/International/tests/repo/results/text-transform
>>
>> wrt uppercase or lowercase transforms, the spec simply says "Puts all
>> letters in lowercase", or vice versa, and that seems to
>> me appropriate, for those characters that have Unicode mappings. The
>> tests text-transform-upperlower-026.html,
>> text-transform-upperlower-027.html indicate that this is what happens
>> across all major desktop browsers.
>>
>> For text-transform: capitalize, however, the spec says "Puts the first
>> *typographic letter unit* of each word in titlecase"
>> (my emphasis).  As you can see in test
>> text-transform-capitalize-031.html, it makes sense when punctuation
>> and the like
>> precede the actual word of the text to look for the first real letter.
>> (All browsers pass that test.)
>>
>> it's not clear to me, however, whether a word that only consists of
>> enclosed alphanumerics (which don't fit the definition of
>> 'typgraphic letter unit'), or even one that starts with an enclosed
>> alphanumeric block character, should be not title cased:
>> see the results of text-transform-capitalize-026.html. Firefox
>> currently does not. Chrome and Safari, on the other hand do
>> titlecase per the Unicode data.  IE titlecases everything except the
>> first word on the page.
>>
>> i can't imagine that people will want to do this very often, so this
>> seems much like an edge case, but i thought i'd ask the
>> question, all the same.
>>
>> what's the answer?
>
> I think we should go with whatever the Unicode case mapping files
> define, and adjust the CSS spec wording to match.

Sorry to keep beating on this issue, but I'm not sure that really 
answers the question here. This isn't primarily about what's in the case 
mapping files -- which deal only with individual Unicode characters -- 
but about identifying the "typographic letter units" to which the case 
mapping should be applied.

The current CSS 'capitalize' transform is quite different in that regard 
from the toTitlecase(s) function[1] defined by Unicode, which as far as 
I can recall is the nearest parallel:

# R3    toTitlecase(X): Find the word boundaries in X according to
# Unicode Standard Annex #29, “Unicode Text Segmentation.” For each
# word boundary, find the first cased character F following the word
# boundary. If F exists, map F to Titlecase_Mapping(F); then map all
# characters C between F and the following word boundary to
# Lowercase_Mapping(C).

In general terms, the key issue here is that toTitlecase applies case 
mappings to all the letters of a word (Titlecase to the first, and 
Lowercase to the rest), whereas the CSS property applies Titlecase to 
the first letter and leaves the rest unchanged. Therefore, given content 
such as

   Ramsay MacDonald visits the USA

text-transform:capitalize will result in

   Ramsay MacDonald Visits The USA

whereas Unicode's toTitlecase() would give

   Ramsay Macdonald Visits The Usa

which I don't think is desirable.

Given this difference in approach, I think we should continue to let CSS 
Text define exactly what text-transform:capitalize does -- in 
particular, which characters it affects -- rather than delegating this 
to Unicode.

As Richard points out, the current draft of CSS Text excludes the 
enclosed alphanumerics ⓐⓑⓒ etc. from its definition of "typographic 
letter units", and therefore they should also be excluded from the 
"words" that 'capitalize' affects. IMO, that's the most reasonable 
option: these characters are more symbol- or dingbat-like than 
letter-like, as reflected in their Unicode General Category of "So". So 
I'd like the WG to confirm that this is the correct interpretation of 
the spec.

A further issue that I don't think has been mentioned here relates to 
the 'uppercase' and 'lowercase' transforms. ISTM that these transforms, 
too, should only affect "letters" (or "typographic letter units", as CSS 
Text likes to call them) and should leave Symbol characters untouched, 
even though some Symbol characters -- by no means all the "enclosed 
letter-based" ones -- do have case mappings. The CSS Text draft is less 
clear about this, inasmuch as it fails to link the term "letters" in 
'uppercase' and 'lowercase' to a definition in the Terminology section 
(as earlier drafts did), but the only plausible interpretation I can see 
is that "letter" here is shorthand for "typographic letter unit", and so 
once again the Symbol characters are excluded.

AFAIK, all engines -- including Gecko, which gets 'capitalize' right by 
this interpretation -- currently mishandle this, and apply case mappings 
to Symbol characters. However, I doubt that changing our behavior to 
match the spec here is likely to "break the Web" in any substantial way, 
and it would put us in a more consistent and predictable state. (It 
would seem odd that 'text-transform:uppercase' affects ⓐⓑⓒ if 
'text-transform:capitalize' does not; or that 'text-transform:lowercase' 
affects ⒶⒷⒸ but not 🅐🅑🅒.)

In summary, I think the CSS Text spec should maintain its definition of 
these transforms as applying only to letters, and should reinstate its 
link to the definition of "[typographic] letter [unit]" for 'uppercase' 
and 'lowercase' to reinforce this. An informative note could be added 
alerting implementers to the fact that some non-Letter characters have 
case mappings defined in Unicode, but should *not* be affected by these 
text-transform values.

JK


[1] http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf, page 154.
Received on Monday, 23 March 2015 18:30:08 UTC