W3C home > Mailing lists > Public > www-international@w3.org > January to March 2011

Re: [css3-text] text-transform:capitalize (was New WD of CSS Text Level 3

From: Brady Duga <duga@ljug.com>
Date: Sat, 19 Feb 2011 15:14:18 -0800
Message-ID: <AANLkTi=LEebviu9UCMZxQr+sTbJXnf944doBd_Qo7Guu@mail.gmail.com>
To: Koji Ishii <kojiishi@gluesoft.co.jp>
Cc: Xaxio Brandish <xaxiobrandish@gmail.com>, John Cowan <cowan@mercury.ccil.org>, W3C style mailing list <www-style@w3.org>, "'WWW International' (www-international@w3.org)" <www-international@w3.org>
Are we suggesting language-specific changes to UAX#29? For instance, the
proper French titlecase of "l'histoire de france pour les nuls" is
"L'Histoire de France pour les Nuls", not "L'histoire de France pour les
Nuls". Ignoring the fact that this would result in more caps then expected
(de, pour and les), it seems like there is no way to get both French
(l'histoire -> L'Histoire) and English (can't -> Can't) titlecasing without
using language-specific word break tables.

--Brady

On Feb 19, 2011, at 2:35 PM, Koji Ishii wrote:

John and Xaxio, thank you a lot for leading this issue to the right
direction.

It looks like this is the way to go:
1. Use UAX#29 Word Boundaries[1] to delimit words
2. Take first letter or numeric of words and if it's a letter, use Unicode
titlecase

There are two problems with this approach:
1. UAX#29 defines "a.a" as a word and therefore it doesn't solve the "a.m."
case Xaxio raised.
2. There are no single browser that use this logic

If we modify UAX#29 to delimit words by "." U+002E and U+FF0E FULLWIDTH FULL
STOP, Safari and Chrome seem to be very close. I tested all punctuation
listed in UAX#29 and the two are the only exceptions (I haven't  tested if
all other punctuation delimit words though.)

So here's the modified proposal:
1. Exclude U+002E and U+FF0E from MidNumLet in UAX#29 Word Boundaries and
use it to delimit words.
2. Take first letter-or-numeric of words (skip punctuation) and if it's a
letter, use Unicode titlecase.

I don't think we need to worry about "O'Donnell" as it's unlikely that
someone writes this as "o'donnell" and apply titlecase to it.

IE9 seems to have "." as one of exceptions to delimit words, and that
worries me that there may be counter cases to "a.m."; i.e., cases where "."
should not delimit words. Does anyone have any idea?

[1] http://www.unicode.org/reports/tr29/#Word_Boundaries


Regards,
Koji

-----Original Message-----
From: Xaxio Brandish [mailto:xaxiobrandish@gmail.com]
Sent: Sunday, February 20, 2011 6:28 AM
To: John Cowan
Cc: Koji Ishii; W3C style mailing list; 'WWW International' (
www-international@w3.org)
Subject: Re: [css3-text] text-transform:capitalize (was New WD of CSS Text
Level 3

John,

I was thinking about commenting on this as well, but I hesitated due to the
characters in Japanese not being technically "letters".  I'm glad that you
said something, because at least I was thinking along the right lines.  I
also hesitated because I was wondering if "word" in the description covers
only letters and already excludes punctuation.

Perhaps "word" should be defined as "characters excluding punctuation and
whitespace".  In Firefox and Chrome tests, numbers directly in before
letters keep the letters from receiving capitalization when using this
property.

Also, what about names like O'Donnell?  Are these kinds of cases
undetectable for the purpose of applying this property...?

--Xaxio
On Sat, Feb 19, 2011 at 1:16 PM, John Cowan <cowan@mercury.ccil.org> wrote:
Koji Ishii scripsit:

Transforms the first character in each word to uppercase; all other

characters remain unaffected; i.e., they're not transformed to

lowercase, but will appear as written in the document.

It seems to me that it is better to speak of the "first letter with case".
For example, "'tis" (short for "it is") titlecases to "'Tis", not "'tis".
Similarly, the word "!Kung" (the name of a South African people) is
correctly so capitalized whether the "!" is the punctuation mark or
the identical-looking U+01C3, a caseless letter.  (The Dutch words 't,
's, and 'n never get capitalized, but we can't have everything.)

Furthermore, the Croatian double letters dj, lj, nj, and dz-with-caron
must be correctly titlecased to Dj, Lj, Nj, and Dz-with-caron, whether
they are represented with one character or two.  Unicode already provides
a titlecase mapping that handles these and other two-letter characters.

--
It was impossible to inveigle           John Cowan <cowan@ccil.org>
Georg Wilhelm Friedrich Hegel           http://www.ccil.org/~cowan
Into offering the slightest apology
For his Phenomenology.                      --W. H. Auden, from "People"
(1953)
Received on Saturday, 19 February 2011 23:17:06 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Saturday, 19 February 2011 23:17:09 GMT