- From: Kang-Hao (Kenny) Lu <kennyluck@w3.org>
- Date: Tue, 26 Jun 2012 08:05:46 +0800
- To: W3C HTML5 中文興趣小組 <public-html-ig-zh@w3.org>
- CC: Matt Ma <matt.ma.umail@gmail.com>, "Claire Ho (賀靜蘭)" <claireho@google.com>
- Message-ID: <4FE8FCDA.3050405@w3.org>
原文討論串:http://www.unicode.org/mail-arch/unicode-ml/y2012-m06 /thread.html#113 好像是有關漢字排序跟筆劃的討論,Unicode 資料庫(CLDR)裡有一些問題的樣 子,有興趣的可以深入了解一下。 另外,ECMAScript 的新的國際化 API[1] 裡面的字串比較不知道有沒有用到這個 資料庫,有興趣的可以研究、翻譯、討論一下。(我還沒花足夠的時間跟 Unicode 聯盟談清楚翻譯的問題,不然 UTR/UTX 也是相當值得翻譯的。)有空可以來仔細 確認一下各個演算法的正確性、、、 [1] http://wiki.ecmascript.org/doku.php?id=globalization:specification_drafts 以上 Kenny -------- Original Message -------- Subject: Re: [cldr-dev] Re: Questions on Chinese collation, stroke Date: Mon, 25 Jun 2012 12:02:36 -0700 From: Matt Ma <matt.ma.umail@gmail.com> To: Stephan Stiller <sstiller@stanford.edu> CC: Mark Davis ☕ <mark@macchiato.com>, Katsuhiko Momoi <katmomoi@gmail.com>, "Claire Ho (賀靜蘭)" <claireho@google.com>, "cldr-users@unicode.org" <cldr-users@unicode.org>, Unicode <unicode@unicode.org> Hi Stephan, I agree that those orders require a great deal of work. For the stroke order, the specification (现代汉语通用字笔顺规范) explicitly shows stroke order on 7000 commonly used Simplified Chinese characters in P.R. China. It also has a set of rules aiming to reduce the ambiguity on how strokes are counted and ordered. Perhaps characters listed in the spec can be used as a starter. Thanks, Matt On Fri, Jun 22, 2012 at 7:43 PM, Stephan Stiller <sstiller@stanford.edu> wrote: > Dear Matt, > > I think those tasks would take a quite a bit of work, because (1) the three > orders you are mentioning are all mathematically underspecified and (2) > they're partial orders even when considering only what you'd normally > consider the respective target domains (certain subsets of CJKV). > > I'm sure many or most people reading this know this, but the question is > which committee would get rid of the underspecification (also, according to > what principles?), fine-tune the respective target domains, and such. > (Perhaps the IICore people have done parts of the footwork already?) > > Stephan > > > On 6/22/2012 5:05 PM, Matt Ma wrote: >> >> Entered ticket #4949 for Simplified Chinese, stroke order. >> >> Thanks, >> Matt >> >> On Fri, Jun 22, 2012 at 12:55 PM, Mark Davis ☕ <mark@macchiato.com> wrote: >>> >>> There are no current plans to do that. If you want to present a case for >>> adding additional collation sequences to CLDR, please start the process >>> by >>> filing a bug at http://unicode.org/cldr/trac/newticket >>> >>> ________________________________ >>> Mark >>> >>> — Il meglio è l’inimico del bene — >>> >>> >>> >>> On Fri, Jun 22, 2012 at 11:05 AM, Matt Ma <matt.ma.umail@gmail.com> >>> wrote: >>>> >>>> Thanks all for clarification. Are there any plans to provider the >>>> following collations in CLDR? >>>> >>>> 1. Simplified Chinese, stroke order, based on 现代汉语通用字笔顺规范 (PRC-China >>>> modern Chinese commonly used characters standard stroke orders, >>>> mentioned in http://en.wikipedia.org/wiki/Stroke_order). >>>> >>>> 2. Simplified Chinese, radical order >>>> >>>> 3. Traditional Chinese, radical order >>>> >>>> Thanks, >>>> Matt >>>> >>>> On Sat, Jun 9, 2012 at 1:02 AM, Katsuhiko Momoi <katmomoi@gmail.com> >>>> wrote: >>>>> >>>>> Unihan-6.2.0d1/Unihan_DictionaryLikeData.txt is lacking the Traditional >>>>> Chinese stroke count. Currently it only lists: >>>>> >>>>> U+8303 kTotalStrokes 8 >>>>> >>>>> I filed a ticket for a review: >>>>> >>>>> http://unicode.org/cldr/trac/ticket/4898 >>>>> >>>>> (I understand that we are supposed to list the Traditional stroke count >>>>> after the Simplified one delimited by a {sp}. >>>>> >>>>> As a general observation, I glanced through a number of kTotalStrokes >>>>> entries for strokes 8 and 9. I did not find a single entry that listed >>>>> 2 >>>>> stroke counts. This seems odd as there should be other stroke count >>>>> differences between Simplified and Traditional Chinese. I suspect that >>>>> this >>>>> is an area needing more than one correction -- it would be better to do >>>>> a >>>>> systematic review. >>>>> >>>>> - Kat >>>>> >>>>> On Fri, Jun 8, 2012 at 3:44 PM, Mark Davis ☕ <mark@macchiato.com> >>>>> wrote: >>>>>> >>>>>> It can supply the data for both, if they differ. That's done with two >>>>>> fields. >>>>>> >>>>>> However, in this case there is only one value; if that's incorrect for >>>>>> this character someone should file feedback. >>>>>> >>>>>> ________________________________ >>>>>> Mark >>>>>> >>>>>> — Il meglio è l’inimico del bene — >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Jun 8, 2012 at 2:41 PM, Claire Ho (賀靜蘭) <claireho@google.com> >>>>>> wrote: >>>>>>> >>>>>>> Check the tr38, from the description of kTotalStrokes, it provides >>>>>>> stroke >>>>>>> count data for simplified Chinese and traditional Chinese. >>>>>>> Then, I don't have concern. >>>>>>> >>>>>>> Thanks! >>>>>>> Claire. >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 8, 2012 at 2:33 PM, Claire Ho (賀靜蘭) <claireho@google.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Mark >>>>>>>> >>>>>>>>> There you find the line: >>>>>>>>> U+8303 kTotalStrokes 8 >>>>>>>> >>>>>>>> In Traditional Chinese, U+8303 has 9 strokes as Matt mentioned in >>>>>>>> the >>>>>>>> email. >>>>>>>> >>>>>>>> The radical "++" is counted as 4 strokes. I think there are several >>>>>>>> radicals have the same issue, different stroke counts, between >>>>>>>> simplified >>>>>>>> Chinese and traditional Chinese. >>>>>>>> >>>>>>>> Claire. >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 7, 2012 at 5:54 PM, Mark Davis ☕ <mark@macchiato.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Thu, Jun 7, 2012 at 4:28 PM, Matt Ma <matt.ma.umail@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I have two questions regarding the collation sequence defined in >>>>>>>>>> zh.xml, CLDR 21.0 >>>>>>>>>> >>>>>>>>>> 1. Why is U+8303 (范) counted as 9 strokes instead of 8 for >>>>>>>>>> <collation >>>>>>>>>> type="stroke">? As a reference, U+59DA (姚) is counted as 9 strokes >>>>>>>>>> but >>>>>>>>>> sorted before U+8303 (范). >>>>>>>>> >>>>>>>>> >>>>>>>>> CLDR now gets the stroke collation data from the kTotalStokes >>>>>>>>> property. >>>>>>>>> The values for that are in the >>>>>>>>> file Unihan/Unihan_DictionaryLikeData.txt in >>>>>>>>> the Unicode Character Database. >>>>>>>>> >>>>>>>>> There you find the line: >>>>>>>>> >>>>>>>>> U+8303 kTotalStrokes 8 >>>>>>>>> >>>>>>>>> If that is in error, or if there is any other error in >>>>>>>>> the kTotalStrokes data, then please report the correct value >>>>>>>>> according to >>>>>>>>> http://www.unicode.org/review/pri230/ so that it can be fixed. >>>>>>>>> >>>>>>>>> As a related matter, CLDR now gets the pinyin collation data from >>>>>>>>> the kMandarin property. The values for that are in the >>>>>>>>> file Unihan/Unihan_Readings.txt in the Unicode Character Database. >>>>>>>>> So if any >>>>>>>>> of those are in error, they should also be reported as >>>>>>>>> per http://www.unicode.org/review/pri230/ . >>>>>>>>> >>>>>>>>> The beta data is >>>>>>>>> in ftp://www.unicode.org/Public/6.2.0/ucd/. Currently >>>>>>>>> in ftp://www.unicode.org/Public/6.2.0/ucd/Unihan-6.2.0d1.zip >>>>>>>>> but as the beta proceeds, the d1 might change to d2,d3... >>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2. Does the collation type, stroke, apply to both Simplified and >>>>>>>>>> Traditional Chinese, as I do not see anything defined in >>>>>>>>>> zh_Hant.xml >>>>>>>>>> under "stroke"? >>>>>>>>> >>>>>>>>> >>>>>>>>> Let me look at that. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Matt >>>>>>>>>> >>>>>>>>>> >>>>> >>>>> >>>>> -- >>>>> Katsuhiko Momoi <katmomoi@gmail.com> >>>>> >>>>> >>> >> > >
Received on Tuesday, 26 June 2012 00:06:16 UTC