漢字排序與筆劃(轉: Re: [cldr-dev] Re: Questions on Chinese collation, stroke)

原文討論串:http://www.unicode.org/mail-arch/unicode-ml/y2012-m06
/thread.html#113

好像是有關漢字排序跟筆劃的討論,Unicode 資料庫(CLDR)裡有一些問題的樣
子,有興趣的可以深入了解一下。

另外,ECMAScript 的新的國際化 API[1] 裡面的字串比較不知道有沒有用到這個
資料庫,有興趣的可以研究、翻譯、討論一下。(我還沒花足夠的時間跟 Unicode
聯盟談清楚翻譯的問題,不然 UTR/UTX 也是相當值得翻譯的。)有空可以來仔細
確認一下各個演算法的正確性、、、

[1]
http://wiki.ecmascript.org/doku.php?id=globalization:specification_drafts

以上

Kenny

-------- Original Message --------
Subject:  Re: [cldr-dev] Re: Questions on Chinese collation, stroke
Date:  Mon, 25 Jun 2012 12:02:36 -0700
From:  Matt Ma <matt.ma.umail@gmail.com>
To:  Stephan Stiller <sstiller@stanford.edu>
CC:  Mark Davis ☕ <mark@macchiato.com>, Katsuhiko Momoi
<katmomoi@gmail.com>, "Claire Ho (賀靜蘭)" <claireho@google.com>,
"cldr-users@unicode.org" <cldr-users@unicode.org>, Unicode
<unicode@unicode.org>



Hi Stephan,

I agree that those orders require a great deal of work.

For the stroke order, the specification (现代汉语通用字笔顺规范) explicitly shows
stroke order on 7000 commonly used Simplified Chinese characters in
P.R. China. It also has a set of rules aiming to reduce the ambiguity
on how strokes are counted and ordered. Perhaps characters listed in
the spec can be used as a starter.

Thanks,
Matt

On Fri, Jun 22, 2012 at 7:43 PM, Stephan Stiller <sstiller@stanford.edu> wrote:
> Dear Matt,
>
> I think those tasks would take a quite a bit of work, because (1) the three
> orders you are mentioning are all mathematically underspecified and (2)
> they're partial orders even when considering only what you'd normally
> consider the respective target domains (certain subsets of CJKV).
>
> I'm sure many or most people reading this know this, but the question is
> which committee would get rid of the underspecification (also, according to
> what principles?), fine-tune the respective target domains, and such.
> (Perhaps the IICore people have done parts of the footwork already?)
>
> Stephan
>
>
> On 6/22/2012 5:05 PM, Matt Ma wrote:
>>
>> Entered ticket #4949 for Simplified Chinese, stroke order.
>>
>> Thanks,
>> Matt
>>
>> On Fri, Jun 22, 2012 at 12:55 PM, Mark Davis ☕ <mark@macchiato.com> wrote:
>>>
>>> There are no current plans to do that. If you want to present a case for
>>> adding additional collation sequences to CLDR, please start the process
>>> by
>>> filing a bug at http://unicode.org/cldr/trac/newticket
>>>
>>> ________________________________
>>> Mark
>>>
>>> — Il meglio è l’inimico del bene —
>>>
>>>
>>>
>>> On Fri, Jun 22, 2012 at 11:05 AM, Matt Ma <matt.ma.umail@gmail.com>
>>> wrote:
>>>>
>>>> Thanks all for clarification. Are there any plans to provider the
>>>> following collations in CLDR?
>>>>
>>>>  1. Simplified Chinese, stroke order, based on 现代汉语通用字笔顺规范 (PRC-China
>>>> modern Chinese commonly used characters standard stroke orders,
>>>> mentioned in http://en.wikipedia.org/wiki/Stroke_order).
>>>>
>>>>  2. Simplified Chinese, radical order
>>>>
>>>>  3. Traditional Chinese, radical order
>>>>
>>>> Thanks,
>>>> Matt
>>>>
>>>> On Sat, Jun 9, 2012 at 1:02 AM, Katsuhiko Momoi <katmomoi@gmail.com>
>>>> wrote:
>>>>>
>>>>> Unihan-6.2.0d1/Unihan_DictionaryLikeData.txt is lacking the Traditional
>>>>> Chinese stroke count. Currently it only lists:
>>>>>
>>>>> U+8303 kTotalStrokes 8
>>>>>
>>>>> I filed a ticket for a review:
>>>>>
>>>>> http://unicode.org/cldr/trac/ticket/4898
>>>>>
>>>>> (I understand that we are supposed to list the Traditional stroke count
>>>>> after the Simplified one delimited by a {sp}.
>>>>>
>>>>> As a general observation, I glanced through a number of kTotalStrokes
>>>>> entries for strokes 8 and 9. I did not find a single entry that listed
>>>>> 2
>>>>> stroke counts. This seems odd as there should be other stroke count
>>>>> differences between Simplified and Traditional Chinese. I suspect that
>>>>> this
>>>>> is an area needing more than one correction -- it would be better to do
>>>>> a
>>>>> systematic review.
>>>>>
>>>>> - Kat
>>>>>
>>>>> On Fri, Jun 8, 2012 at 3:44 PM, Mark Davis ☕ <mark@macchiato.com>
>>>>> wrote:
>>>>>>
>>>>>> It can supply the data for both, if they differ. That's done with two
>>>>>> fields.
>>>>>>
>>>>>> However, in this case there is only one value; if that's incorrect for
>>>>>> this character someone should file feedback.
>>>>>>
>>>>>> ________________________________
>>>>>> Mark
>>>>>>
>>>>>> — Il meglio è l’inimico del bene —
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 8, 2012 at 2:41 PM, Claire Ho (賀靜蘭) <claireho@google.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Check the tr38, from the description of kTotalStrokes, it provides
>>>>>>> stroke
>>>>>>> count data for simplified Chinese and traditional Chinese.
>>>>>>> Then, I don't have concern.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Claire.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 8, 2012 at 2:33 PM, Claire Ho (賀靜蘭) <claireho@google.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Mark
>>>>>>>>
>>>>>>>>> There you find the line:
>>>>>>>>> U+8303 kTotalStrokes 8
>>>>>>>>
>>>>>>>> In Traditional Chinese, U+8303 has 9 strokes as Matt mentioned in
>>>>>>>> the
>>>>>>>> email.
>>>>>>>>
>>>>>>>> The radical "++" is counted as 4 strokes. I think there are several
>>>>>>>> radicals have the same issue, different stroke counts, between
>>>>>>>> simplified
>>>>>>>> Chinese and traditional Chinese.
>>>>>>>>
>>>>>>>> Claire.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 7, 2012 at 5:54 PM, Mark Davis ☕ <mark@macchiato.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> On Thu, Jun 7, 2012 at 4:28 PM, Matt Ma <matt.ma.umail@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I have two questions regarding the collation sequence defined in
>>>>>>>>>> zh.xml, CLDR 21.0
>>>>>>>>>>
>>>>>>>>>> 1. Why is U+8303 (范)  counted as 9 strokes instead of 8 for
>>>>>>>>>> <collation
>>>>>>>>>> type="stroke">? As a reference, U+59DA (姚) is counted as 9 strokes
>>>>>>>>>> but
>>>>>>>>>> sorted before U+8303 (范).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> CLDR now gets the stroke collation data from the kTotalStokes
>>>>>>>>> property.
>>>>>>>>> The values for that are in the
>>>>>>>>> file Unihan/Unihan_DictionaryLikeData.txt in
>>>>>>>>> the Unicode Character Database.
>>>>>>>>>
>>>>>>>>> There you find the line:
>>>>>>>>>
>>>>>>>>> U+8303 kTotalStrokes 8
>>>>>>>>>
>>>>>>>>> If that is in error, or if there is any other error in
>>>>>>>>> the kTotalStrokes data, then please report the correct value
>>>>>>>>> according to
>>>>>>>>> http://www.unicode.org/review/pri230/ so that it can be fixed.
>>>>>>>>>
>>>>>>>>> As a related matter, CLDR now gets the pinyin collation data from
>>>>>>>>> the kMandarin property. The values for that are in the
>>>>>>>>> file Unihan/Unihan_Readings.txt in the Unicode Character Database.
>>>>>>>>> So if any
>>>>>>>>> of those are in error, they should also be reported as
>>>>>>>>> per http://www.unicode.org/review/pri230/ .
>>>>>>>>>
>>>>>>>>> The beta data is
>>>>>>>>> in ftp://www.unicode.org/Public/6.2.0/ucd/. Currently
>>>>>>>>> in ftp://www.unicode.org/Public/6.2.0/ucd/Unihan-6.2.0d1.zip
>>>>>>>>> but as the beta proceeds, the d1 might change to d2,d3...
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2. Does the collation type, stroke, apply to both Simplified and
>>>>>>>>>> Traditional Chinese, as I do not see anything defined in
>>>>>>>>>> zh_Hant.xml
>>>>>>>>>> under "stroke"?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Let me look at that.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Matt
>>>>>>>>>>
>>>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Katsuhiko Momoi <katmomoi@gmail.com>
>>>>>
>>>>>
>>>
>>
>
>

Received on Tuesday, 26 June 2012 00:06:16 UTC