RE: Titlecasing words starting with numeric glyphs and period as word separator from Phillips, Addison on 2011-02-23 (public-i18n-core@w3.org from January to March 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Wed, 23 Feb 2011 08:51:55 -0800
To: Mark Davis ☕ <mark@macchiato.com>, Koji Ishii <kojiishi@gluesoft.co.jp>
CC: "ishida@w3.org" <ishida@w3.org>, "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <C7A5719F1E562149BA9171F58BEE2CA412CBB6D893@EX-IAD6-B.ant.amazon.com>
Hello Mark,

We discussed this topic today in the Internationalization Core WG (Koji is a member of our WG).

To summarize this topic, basically CSS text-transform “capitalize” exists as a sort of legacy text transformation. Its shortcomings are reasonably well-known and the I18N WG’s current sense of this issue is that we will work with CSS to document the limitations of using it for titlecasing, as well as a few useful cases. The I18N WG decided not to formally seek additional work by the UTC or CLDR-TC because CSS currently intends merely to document the current implementation behavior more clearly, referring to existing text in UAX#29 but not mandating any particular language-aware tailoring. There is recognition that this feature is of only marginal utility and/or outright destructive for some languages and scripts and that a general purpose solution that is widely interoperable would be difficult to achieve.

It should be noted, however, that we would welcome work by Unicode, since it would be useful for many applications if language-specific tailorings to support word selection/segmentation and to support e.g. titlecase processing were gathered and formalized. That is, we recognize that it may be important, but it is not urgent.

Koji: if at some point you or the CSS WS would like to make further requests of the UTC or CLDR-TC to support on-going work, please don’t hesitate to contact either Richard or myself. We’d be happy to help use the W3C liaison relationship to facilitate your needs.

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N, IETF IRI WGs)

Internationalization is not a feature.
It is an architecture.


From: mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] On Behalf Of Mark Davis ?
Sent: Wednesday, February 23, 2011 7:58 AM
To: Koji Ishii; ishida@w3.org; Phillips, Addison
Cc: unicode@unicode.org
Subject: Re: Titlecasing words starting with numeric glyphs and period as word separator

I didn't take what you said as at all brash - you and others at CSS are looking for a solution to your issue, and there is no reason for you to know the structure and process used in the Unicode Consortium. Such a solution could involve use of structure and properties already defined (by the UTC and CLDR-TC), or result in improvements or extensions to those structures.

I should have also mentioned that the W3C has a liaison relationship with the Unicode Consortium, and you can also work through knowledgeable people in the i18n group in the W3C, such as Richard Ishida and Addison Phillips.

Mark

— Il meglio è l’inimico del bene —

On Tue, Feb 22, 2011 at 20:14, Koji Ishii <kojiishi@gluesoft.co.jp<mailto:kojiishi@gluesoft.co.jp>> wrote:
Thank you Mark for leading me.

I apologize any brashness, as I’m new to here.

I didn’t write what I want very clearly, I’m sorry about that, but all I want for now is just to present what were talked at CSS, listen to what people here would say, and hopefully have some discussions.

I’m not sure if I want it be on the next agenda at this point, but I’ll follow your instructions if I want to.


Regards,
Koji

From: mark.edward.davis@gmail.com<mailto:mark.edward.davis@gmail.com> [mailto:mark.edward.davis@gmail.com<mailto:mark.edward.davis@gmail.com>] On Behalf Of Mark Davis ?
Sent: Tuesday, February 22, 2011 4:56 PM
To: Koji Ishii
Cc: unicode@unicode.org<mailto:unicode@unicode.org>
Subject: Re: Titlecasing words starting with numeric glyphs and period as word separator

The default Unicode rules cannot cover all languages or circumstances properly. It is worth bringing up to the Unicode technical committee any proposals (and/or problem cases) with the default rules, but bear in mind that those default rules will never be able to cover all languages well. Acronyms, hyphenations, and contractions present particular problems: there are some notes on some of them in http://www.unicode.org/reports/tr29/.


You can have discussions here or on the http://unicode.org/forum/, but to get on the next agenda (May) for the UTC, make sure that there is a proposal filed by a member or by you on http://www.unicode.org/reporting.html.


> "word separating rules optimized for titlecasing" could be slightly different from general word separating rules

Language-specific rules such as for titlecasing, fall under the CLDR technical committee<http://cldr.unicode.org/>. There have been tickets filed for adding structure and data for language-specific titlecasing some time ago, but it hadn't reached a high enough relative priority for the committee to work on. Having such "word separating rules optimized for titlecasing" was the direction the committee was thinking of. I put it on the agenda for the next CLDR meeting (that committee meets weekly by phone), and you can file a ticket with additional information and/or example problem cases that you'd like to see handled: http://unicode.org/cldr/trac/newticket


Mark

— Il meglio è l’inimico del bene —
On Mon, Feb 21, 2011 at 23:15, Koji Ishii <kojiishi@gluesoft.co.jp<mailto:kojiishi@gluesoft.co.jp>> wrote:
Hello,

There's a discussion going on in W3C CSS mailing list[1] about specifications of the text-transform property[2], specifically how the "capitalize" value that titlecase specified span of text.

During the discussion, two cases were presented:

1. Titlecasing words starting with numeric glyphs (e.g., "99ers") can be "99Ers" if we follow the rules defined in 5.18 Case Mappings. Is this discussed here and it's up to implementations to define which words to apply titlecasing, or should this be fixed in Unicode spec?

2. We're thinking to use UAX #24 to separate words and then apply Titlecase_Mapping to every word. But doing so makes "a.m." to be "A.m." and it contradicts with the general publication rules[3]. While I understand both separating words and titlecasing are ambiguous, cannot be perfect, and we must make compromises. But since Unicode defines these two rules separately, I guess there's a possibility that "word separating rules optimized for titlecasing" could be slightly different from general word separating rules. I haven't thought much about counter-cases for not doing so, but I wonder if anyone in this ML could have idea including whether we should do it or not, or we should include more other cases.

Any feedback is greatly appreciated.


Regards,
Koji

[1] http://lists.w3.org/Archives/Public/www-style/2011Feb/0621.html

[2] http://dev.w3.org/csswg/css3-text/#text-transform

[3] http://www.businesswritingblog.com/business_writing/2009/06/what-is-the-correct-time-am-pm-am-pm-am-pm-.html
Received on Wednesday, 23 February 2011 16:52:28 UTC