W3C home > Mailing lists > Public > www-international@w3.org > January to March 2011

RE: [css3-text] Thai line breaking rules

From: Koji Ishii <kojiishi@gluesoft.co.jp>
Date: Thu, 10 Feb 2011 01:00:46 -0500
To: Mark Davis ☕ <mark@macchiato.com>
CC: "www-style@w3.org" <www-style@w3.org>, "'WWW International' (www-international@w3.org)" <www-international@w3.org>
Message-ID: <A592E245B36A8949BDB0A302B375FB4E0AAF13A7AE@MAILR001.mail.lan>
Mark, thank you for your comment.

Minegishi-san is busy to discuss further on this for a while so I expect this takes a bit long. But in the meantime, fantasai and I talked about this and we think CSS3 Text[1] already refers JIS4051, ZHMARK, and UAX14, so it would be ideal if UAX14 can cover the requirements.

UAX14 mentions scripts such as Thai is beyond the scope of the Unicode Standard[2]. I agree with it, but we’d like to seek for a possibility to find rules that does minimum-level of line-breaking rules, or better-than-nothing, in the absence of the dictionary, since browser vendors tend to wish download size as small as possible.

Once Minegishi-san has got time to back to the discussion, we’ll talk if his rules are about grapheme clusters or line-break specific issues, and get back to the URL you gave me. I expect it’d give us an idea whether we want to fix UAX29 or to add to UAX14.

I haven’t done any to Unicode before, your support for this would be greatly appreciated.

[1] http://dev.w3.org/csswg/css3-text/#line-breaking

[2] http://www.unicode.org/reports/tr14/tr14-17.html#BreakOpportunities



Regards,
Koji

From: mark.edward.davis@gmail.com [mailto:mark.edward.davis@gmail.com] On Behalf Of Mark Davis ?
Sent: Tuesday, February 08, 2011 2:46 AM
To: Koji Ishii
Cc: www-style@w3.org; 'WWW International' (www-international@w3.org)
Subject: Re: [css3-text] Thai line breaking rules

Please also file any feedback you have on breaking conditions (aka boundaries, segmentation) for particular languages at

http://unicode.org/cldr/trac/newticket


Please specify whether it is word-break, line-break, or other types of breaks.

Mark

— Il meglio è l’inimico del bene —

On Mon, Feb 7, 2011 at 01:06, Koji Ishii <kojiishi@gluesoft.co.jp<mailto:kojiishi@gluesoft.co.jp>> wrote:
Removed rules using extended grapheme cluster as defined in UAX #29[1].

I'll ask Minegishi-san at ILCAA to review this, but any feedback is also appreciated.

DO NOT BREAK BEFORE:
* U+0E2F
* U+0E5A

DO NOT BREAK BETWEEN:
* [U+0E31, U+0E3A] and <Consonants>
* U+0E3F THAI Currency Symbol BAHT and digits
* Digits ([U+0E50-0E59] and [U+0E50-0E59])

Covered by using extended grapheme cluster in UAX #29
* <Consonants> and [U+0E30-0E3A]
* [U+0E40-0E44] and <Consonants>
* [U+0E24, U+0E26] and U+0E45
* Any and U+0E46 (category=Lm)
* <Consonants> and [U+0E47]
* (<Consonants> or [U+0E34-0E39]) and [U+0E48-0E4B]
* (<Consonants> or [U+0E34-0E39]) and U+0E4C
* <Consonants> and [U+0E4D-0E4E]
[1] http://unicode.org/reports/tr29/


Regards,
Koji

-----Original Message-----
From: www-style-request@w3.org<mailto:www-style-request@w3.org> [mailto:www-style-request@w3.org<mailto:www-style-request@w3.org>] On Behalf Of Koji Ishii
Sent: Monday, February 07, 2011 12:47 PM
To: www-style@w3.org<mailto:www-style@w3.org>
Subject: [css3-text] Thai line breaking rules
I had a meeting with ILCAA, Research Institute for Languages and Cultures of Asia and Africa[1] in Tokyo. Minegishi-san at ILCAA presented his idea for the issue currently mentioned in the CSS3 Text spec[2]:

> Additionally, some guidance should be provided on how to break or not
> break Southeast Asian in the absence of a dictionary.

Here's his draft of the simple line breaking rules in the absence of a dictionary for Thai scripts. Any corrections, and/or opinions whether to include this in the spec or not would be appreciated.

Thai character groups are based on TIS 620-2553 as written in Unicode spec[3].
 Consonants: U+0E01-0E2E

Line breaks are prohibited between:
* Any and U+0E2F
* <Consonants> and [U+0E30-0E3A]
* [U+0E31, U+0E3A, U+0E40-0E44] and <Consonants>
* U+0E3F THAI Currency Symbol BAHT and digits
* [U+0E24, U+0E26] and U+0E45
* [U+0E50-0E59] and [U+0E50-0E59]
* Any and U+0E5A

Following rules are also presented, but they are Unicode Lm or Mn category and therefore I suspect that UAX#29 Unicode Text Segmentation should cover these rules.
* Any and U+0E46
* <Consonants> and [U+0E47]
* (<Consonants> or [U+0E34-0E39]) and [U+0E48-0E4B]
* (<Consonants> or [U+0E34-0E39]) and U+0E4C
* <Consonants> and [U+0E4D-0E4E]

[1] http://www.aa.tufs.ac.jp/en

[2] http://dev.w3.org/csswg/css3-text/#line-breaking

[3] http://unicode.org/charts/PDF/U0E00.pdf

[4] http://unicode.org/reports/tr29/


Regards,
Koji


Received on Thursday, 10 February 2011 06:01:37 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 10 February 2011 06:01:39 GMT