W3C home > Mailing lists > Public > www-style@w3.org > February 2011

RE: [css3-text] Thai line breaking rules

From: Koji Ishii <kojiishi@gluesoft.co.jp>
Date: Mon, 7 Feb 2011 04:06:07 -0500
To: "www-style@w3.org" <www-style@w3.org>
CC: "'WWW International' (www-international@w3.org)" <www-international@w3.org>
Message-ID: <A592E245B36A8949BDB0A302B375FB4E0AAF13A50A@MAILR001.mail.lan>
Removed rules using extended grapheme cluster as defined in UAX #29[1].

I'll ask Minegishi-san at ILCAA to review this, but any feedback is also appreciated.

* U+0E2F
* U+0E5A

* [U+0E31, U+0E3A] and <Consonants>
* U+0E3F THAI Currency Symbol BAHT and digits
* Digits ([U+0E50-0E59] and [U+0E50-0E59])

Covered by using extended grapheme cluster in UAX #29
* <Consonants> and [U+0E30-0E3A]
* [U+0E40-0E44] and <Consonants>
* [U+0E24, U+0E26] and U+0E45
* Any and U+0E46 (category=Lm)
* <Consonants> and [U+0E47]
* (<Consonants> or [U+0E34-0E39]) and [U+0E48-0E4B]
* (<Consonants> or [U+0E34-0E39]) and U+0E4C
* <Consonants> and [U+0E4D-0E4E]

[1] http://unicode.org/reports/tr29/


-----Original Message-----
From: www-style-request@w3.org [mailto:www-style-request@w3.org] On Behalf Of Koji Ishii
Sent: Monday, February 07, 2011 12:47 PM
To: www-style@w3.org
Subject: [css3-text] Thai line breaking rules

I had a meeting with ILCAA, Research Institute for Languages and Cultures of Asia and Africa[1] in Tokyo. Minegishi-san at ILCAA presented his idea for the issue currently mentioned in the CSS3 Text spec[2]:

> Additionally, some guidance should be provided on how to break or not 
> break Southeast Asian in the absence of a dictionary.

Here's his draft of the simple line breaking rules in the absence of a dictionary for Thai scripts. Any corrections, and/or opinions whether to include this in the spec or not would be appreciated.

Thai character groups are based on TIS 620-2553 as written in Unicode spec[3].
  Consonants: U+0E01-0E2E

Line breaks are prohibited between:
* Any and U+0E2F
* <Consonants> and [U+0E30-0E3A]
* [U+0E31, U+0E3A, U+0E40-0E44] and <Consonants>
* U+0E3F THAI Currency Symbol BAHT and digits
* [U+0E24, U+0E26] and U+0E45
* [U+0E50-0E59] and [U+0E50-0E59]
* Any and U+0E5A

Following rules are also presented, but they are Unicode Lm or Mn category and therefore I suspect that UAX#29 Unicode Text Segmentation should cover these rules.
* Any and U+0E46
* <Consonants> and [U+0E47]
* (<Consonants> or [U+0E34-0E39]) and [U+0E48-0E4B]
* (<Consonants> or [U+0E34-0E39]) and U+0E4C
* <Consonants> and [U+0E4D-0E4E]

[1] http://www.aa.tufs.ac.jp/en
[2] http://dev.w3.org/csswg/css3-text/#line-breaking
[3] http://unicode.org/charts/PDF/U0E00.pdf
[4] http://unicode.org/reports/tr29/

Received on Monday, 7 February 2011 09:05:30 UTC

This archive was generated by hypermail 2.4.0 : Friday, 25 March 2022 10:07:55 UTC