- From: Markus Scherer <duerst@w3.org>
- Date: Fri, 12 Jul 2002 09:14:36 +0900
- To: www-i18n-comments@w3.org
This is a last call comment from Markus Scherer (markus.scherer@jtcsv.com) on
the Character Model for the World Wide Web 1.0
(http://www.w3.org/TR/2002/WD-charmod-20020430/).
Semi-structured version of the comment:
Submitted by: Markus Scherer (markus.scherer@jtcsv.com)
Submitted on behalf of (maybe empty):
Comment type: substantive
Chapter/section the comment applies to: 4.1.3 The choice of Normalization
Form C
The comment will be visible to: public
Comment title: Suggest normalization to "FCC" not NFC
Comment:
Suggestion, from experience working on ICU collation, normalization, etc.:
For Unicode text processing, where canonically equivalent texts should be
treated in equivalent ways (e.g., compare/collate equal, string search,
display), it is most efficient to work with text that is canonically
ordered. This does not require NFD; it works with most but not all NFC texts.
I suggest to consider using a variation of NFC that in its composition step
only composes contiguously, i.e., only adjacent characters.
UAX 15 NFC also composes discontiguously, skipping intermediate combining
marks, which makes some NFC texts not canonically ordered.
I would like to call this "FCC", "Fast C contiguous" or "Form C
contiguous", similar to Mark Davis's "FCD" ("Fast C or D") for all texts
that are canonically ordered.
Unlike FCD, FCC is unique. Its implementation is simple: Take NFC code and
disable discontiguous composition.
Since almost all practical NFC texts are also FCC, it has the same
advantages for the Web as NFC, and normalization from one to the other is
fast (mostly a quick check + no-op).
Sincerely,
markus
References:
See "FCD" in each of the following -
http://www.unicode.org/iuc/iuc21/a348.html
= http://oss.software.ibm.com/icu/docs/papers/normalization_iuc21.ppt
http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/collation/ICU_
collation_design.htm
http://oss.software.ibm.com/icu/apiref/unorm_8h.html#_details
Structured version of the comment:
<lc-comment
visibility="public" status="pending"
decision="pending" impact="substantive">
<originator email="markus.scherer@jtcsv.com" represents="-"
>Markus Scherer</originator>
<charmod-section
href='http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-ChoiceNFC'
>4.1.3</charmod-section>
<title>Suggest normalization to "FCC" not NFC</title>
<description>
<comment>
<dated-link date="2002-07-11"
>Suggest normalization to "FCC" not NFC</dated-link>
<para>Suggestion, from experience working on ICU collation,
normalization, etc.:
For Unicode text processing, where canonically equivalent texts should be
treated in equivalent ways (e.g., compare/collate equal, string search,
display), it is most efficient to work with text that is canonically
ordered. This does not require NFD; it works with most but not all NFC texts.
I suggest to consider using a variation of NFC that in its composition step
only composes contiguously, i.e., only adjacent characters.
UAX 15 NFC also composes discontiguously, skipping intermediate combining
marks, which makes some NFC texts not canonically ordered.
I would like to call this "FCC", "Fast C contiguous" or "Form C
contiguous", similar to Mark Davis's "FCD" ("Fast C or D") for all texts
that are canonically ordered.
Unlike FCD, FCC is unique. Its implementation is simple: Take NFC code and
disable discontiguous composition.
Since almost all practical NFC texts are also FCC, it has the same
advantages for the Web as NFC, and normalization from one to the other is
fast (mostly a quick check + no-op).
Sincerely,
markus
References:
See "FCD" in each of the following -
http://www.unicode.org/iuc/iuc21/a348.html
= http://oss.software.ibm.com/icu/docs/papers/normalization_iuc21.ppt
http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/collation/ICU_
collation_design.htm
http://oss.software.ibm.com/icu/apiref/unorm_8h.html#_details
</para>
</comment>
</description>
</lc-comment>
Received on Thursday, 11 July 2002 20:21:10 UTC