RE: [css3-text] text-transform:capitalize (was New WD of CSS Text Level 3

> Are we suggesting language-specific changes to UAX#29?

No, I'm looking for an appropriate level of features for the CSS text-transform:capitalize property to support, and in that sense, I think UAX #29 is a good candidate to define the level.

I didn't know the French case you raised, thank you for letting us know about it.

I originally thought this is an easy feature, and all major browsers already support, so we just need to write the spec down. The discussion then discovered that none of them are interoperable today, and doing it right for everyone/every case is pretty difficult.

Would you mind if I ask, what level should CSS text-transform:capitalize support?


Regards,
Koji

-----Original Message-----
From: bradyduga@gmail.com [mailto:bradyduga@gmail.com] On Behalf Of Brady Duga
Sent: Sunday, February 20, 2011 8:14 AM
To: Koji Ishii
Cc: Xaxio Brandish; John Cowan; W3C style mailing list; 'WWW International' (www-international@w3.org)
Subject: Re: [css3-text] text-transform:capitalize (was New WD of CSS Text Level 3

Are we suggesting language-specific changes to UAX#29? For instance, the proper French titlecase of "l'histoire de france pour les nuls" is "L'Histoire de France pour les Nuls", not "L'histoire de France pour les Nuls". Ignoring the fact that this would result in more caps then expected (de, pour and les), it seems like there is no way to get both French (l'histoire -> L'Histoire) and English (can't -> Can't) titlecasing without using language-specific word break tables.

--Brady

On Feb 19, 2011, at 2:35 PM, Koji Ishii wrote:


John and Xaxio, thank you a lot for leading this issue to the right direction.

It looks like this is the way to go:
1. Use UAX#29 Word Boundaries[1] to delimit words
2. Take first letter or numeric of words and if it's a letter, use Unicode titlecase

There are two problems with this approach:
1. UAX#29 defines "a.a" as a word and therefore it doesn't solve the "a.m." case Xaxio raised.
2. There are no single browser that use this logic

If we modify UAX#29 to delimit words by "." U+002E and U+FF0E FULLWIDTH FULL STOP, Safari and Chrome seem to be very close. I tested all punctuation listed in UAX#29 and the two are the only exceptions (I haven't  tested if all other punctuation delimit words though.)

So here's the modified proposal:
1. Exclude U+002E and U+FF0E from MidNumLet in UAX#29 Word Boundaries and use it to delimit words.
2. Take first letter-or-numeric of words (skip punctuation) and if it's a letter, use Unicode titlecase.

I don't think we need to worry about "O'Donnell" as it's unlikely that someone writes this as "o'donnell" and apply titlecase to it.

IE9 seems to have "." as one of exceptions to delimit words, and that worries me that there may be counter cases to "a.m."; i.e., cases where "." should not delimit words. Does anyone have any idea?

[1] http://www.unicode.org/reports/tr29/#Word_Boundaries


Regards,
Koji

-----Original Message-----
From: Xaxio Brandish [mailto:xaxiobrandish@gmail.com] 
Sent: Sunday, February 20, 2011 6:28 AM
To: John Cowan
Cc: Koji Ishii; W3C style mailing list; 'WWW International' (www-international@w3.org)
Subject: Re: [css3-text] text-transform:capitalize (was New WD of CSS Text Level 3

John,

I was thinking about commenting on this as well, but I hesitated due to the characters in Japanese not being technically "letters".  I'm glad that you said something, because at least I was thinking along the right lines.  I also hesitated because I was wondering if "word" in the description covers only letters and already excludes punctuation.

Perhaps "word" should be defined as "characters excluding punctuation and whitespace".  In Firefox and Chrome tests, numbers directly in before letters keep the letters from receiving capitalization when using this property.

Also, what about names like O'Donnell?  Are these kinds of cases undetectable for the purpose of applying this property...?

--Xaxio
On Sat, Feb 19, 2011 at 1:16 PM, John Cowan <cowan@mercury.ccil.org> wrote:
Koji Ishii scripsit:


Transforms the first character in each word to uppercase; all other
characters remain unaffected; i.e., they're not transformed to
lowercase, but will appear as written in the document.
It seems to me that it is better to speak of the "first letter with case".
For example, "'tis" (short for "it is") titlecases to "'Tis", not "'tis".
Similarly, the word "!Kung" (the name of a South African people) is
correctly so capitalized whether the "!" is the punctuation mark or
the identical-looking U+01C3, a caseless letter.  (The Dutch words 't,
's, and 'n never get capitalized, but we can't have everything.)

Furthermore, the Croatian double letters dj, lj, nj, and dz-with-caron
must be correctly titlecased to Dj, Lj, Nj, and Dz-with-caron, whether
they are represented with one character or two.  Unicode already provides
a titlecase mapping that handles these and other two-letter characters.

--
It was impossible to inveigle           John Cowan <cowan@ccil.org>
Georg Wilhelm Friedrich Hegel           http://www.ccil.org/~cowan
Into offering the slightest apology
For his Phenomenology.                      --W. H. Auden, from "People" (1953)

Received on Sunday, 20 February 2011 12:56:00 UTC