- From: Brady Duga <duga@ljug.com>
- Date: Sat, 19 Feb 2011 15:14:18 -0800
- To: Koji Ishii <kojiishi@gluesoft.co.jp>
- Cc: Xaxio Brandish <xaxiobrandish@gmail.com>, John Cowan <cowan@mercury.ccil.org>, W3C style mailing list <www-style@w3.org>, "'WWW International' (www-international@w3.org)" <www-international@w3.org>
- Message-ID: <AANLkTi=LEebviu9UCMZxQr+sTbJXnf944doBd_Qo7Guu@mail.gmail.com>
Are we suggesting language-specific changes to UAX#29? For instance, the proper French titlecase of "l'histoire de france pour les nuls" is "L'Histoire de France pour les Nuls", not "L'histoire de France pour les Nuls". Ignoring the fact that this would result in more caps then expected (de, pour and les), it seems like there is no way to get both French (l'histoire -> L'Histoire) and English (can't -> Can't) titlecasing without using language-specific word break tables. --Brady On Feb 19, 2011, at 2:35 PM, Koji Ishii wrote: John and Xaxio, thank you a lot for leading this issue to the right direction. It looks like this is the way to go: 1. Use UAX#29 Word Boundaries[1] to delimit words 2. Take first letter or numeric of words and if it's a letter, use Unicode titlecase There are two problems with this approach: 1. UAX#29 defines "a.a" as a word and therefore it doesn't solve the "a.m." case Xaxio raised. 2. There are no single browser that use this logic If we modify UAX#29 to delimit words by "." U+002E and U+FF0E FULLWIDTH FULL STOP, Safari and Chrome seem to be very close. I tested all punctuation listed in UAX#29 and the two are the only exceptions (I haven't tested if all other punctuation delimit words though.) So here's the modified proposal: 1. Exclude U+002E and U+FF0E from MidNumLet in UAX#29 Word Boundaries and use it to delimit words. 2. Take first letter-or-numeric of words (skip punctuation) and if it's a letter, use Unicode titlecase. I don't think we need to worry about "O'Donnell" as it's unlikely that someone writes this as "o'donnell" and apply titlecase to it. IE9 seems to have "." as one of exceptions to delimit words, and that worries me that there may be counter cases to "a.m."; i.e., cases where "." should not delimit words. Does anyone have any idea? [1] http://www.unicode.org/reports/tr29/#Word_Boundaries Regards, Koji -----Original Message----- From: Xaxio Brandish [mailto:xaxiobrandish@gmail.com] Sent: Sunday, February 20, 2011 6:28 AM To: John Cowan Cc: Koji Ishii; W3C style mailing list; 'WWW International' ( www-international@w3.org) Subject: Re: [css3-text] text-transform:capitalize (was New WD of CSS Text Level 3 John, I was thinking about commenting on this as well, but I hesitated due to the characters in Japanese not being technically "letters". I'm glad that you said something, because at least I was thinking along the right lines. I also hesitated because I was wondering if "word" in the description covers only letters and already excludes punctuation. Perhaps "word" should be defined as "characters excluding punctuation and whitespace". In Firefox and Chrome tests, numbers directly in before letters keep the letters from receiving capitalization when using this property. Also, what about names like O'Donnell? Are these kinds of cases undetectable for the purpose of applying this property...? --Xaxio On Sat, Feb 19, 2011 at 1:16 PM, John Cowan <cowan@mercury.ccil.org> wrote: Koji Ishii scripsit: Transforms the first character in each word to uppercase; all other characters remain unaffected; i.e., they're not transformed to lowercase, but will appear as written in the document. It seems to me that it is better to speak of the "first letter with case". For example, "'tis" (short for "it is") titlecases to "'Tis", not "'tis". Similarly, the word "!Kung" (the name of a South African people) is correctly so capitalized whether the "!" is the punctuation mark or the identical-looking U+01C3, a caseless letter. (The Dutch words 't, 's, and 'n never get capitalized, but we can't have everything.) Furthermore, the Croatian double letters dj, lj, nj, and dz-with-caron must be correctly titlecased to Dj, Lj, Nj, and Dz-with-caron, whether they are represented with one character or two. Unicode already provides a titlecase mapping that handles these and other two-letter characters. -- It was impossible to inveigle John Cowan <cowan@ccil.org> Georg Wilhelm Friedrich Hegel http://www.ccil.org/~cowan Into offering the slightest apology For his Phenomenology. --W. H. Auden, from "People" (1953)
Received on Saturday, 19 February 2011 23:17:06 UTC