Suggest normalization to "FCC" not NFC

This is a last call comment from Markus Scherer (markus.scherer@jtcsv.com) on
the Character Model for the World Wide Web 1.0
(http://www.w3.org/TR/2002/WD-charmod-20020430/).

Semi-structured version of the comment:

Submitted by: Markus Scherer (markus.scherer@jtcsv.com)
Submitted on behalf of (maybe empty):
Comment type: substantive
Chapter/section the comment applies to: 4.1.3 The choice of Normalization 
Form C
The comment will be visible to: public
Comment title: Suggest normalization to "FCC" not NFC
Comment:
Suggestion, from experience working on ICU collation, normalization, etc.:

For Unicode text processing, where canonically equivalent texts should be 
treated in equivalent ways (e.g., compare/collate equal, string search, 
display), it is most efficient to work with text that is canonically 
ordered. This does not require NFD; it works with most but not all NFC texts.

I suggest to consider using a variation of NFC that in its composition step 
only composes contiguously, i.e., only adjacent characters.
UAX 15 NFC also composes discontiguously, skipping intermediate combining 
marks, which makes some NFC texts not canonically ordered.

I would like to call this "FCC", "Fast C contiguous" or "Form C 
contiguous", similar to Mark Davis's "FCD" ("Fast C or D") for all texts 
that are canonically ordered.

Unlike FCD, FCC is unique. Its implementation is simple: Take NFC code and 
disable discontiguous composition.
Since almost all practical NFC texts are also FCC, it has the same 
advantages for the Web as NFC, and normalization from one to the other is 
fast (mostly a quick check + no-op).

Sincerely,
markus

References:
See "FCD" in each of the following -

http://www.unicode.org/iuc/iuc21/a348.html
= http://oss.software.ibm.com/icu/docs/papers/normalization_iuc21.ppt

http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/collation/ICU_ 
collation_design.htm

http://oss.software.ibm.com/icu/apiref/unorm_8h.html#_details



Structured version of  the comment:

<lc-comment
   visibility="public" status="pending"
   decision="pending" impact="substantive">
   <originator email="markus.scherer@jtcsv.com" represents="-"
       >Markus Scherer</originator>
   <charmod-section 
href='http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-ChoiceNFC'
     >4.1.3</charmod-section>
   <title>Suggest normalization to "FCC" not NFC</title>
   <description>
     <comment>
       <dated-link date="2002-07-11"
         >Suggest normalization to "FCC" not NFC</dated-link>
       <para>Suggestion, from experience working on ICU collation, 
normalization, etc.:

For Unicode text processing, where canonically equivalent texts should be 
treated in equivalent ways (e.g., compare/collate equal, string search, 
display), it is most efficient to work with text that is canonically 
ordered. This does not require NFD; it works with most but not all NFC texts.

I suggest to consider using a variation of NFC that in its composition step 
only composes contiguously, i.e., only adjacent characters.
UAX 15 NFC also composes discontiguously, skipping intermediate combining 
marks, which makes some NFC texts not canonically ordered.

I would like to call this "FCC", "Fast C contiguous" or "Form C 
contiguous", similar to Mark Davis's "FCD" ("Fast C or D") for all texts 
that are canonically ordered.

Unlike FCD, FCC is unique. Its implementation is simple: Take NFC code and 
disable discontiguous composition.
Since almost all practical NFC texts are also FCC, it has the same 
advantages for the Web as NFC, and normalization from one to the other is 
fast (mostly a quick check + no-op).

Sincerely,
markus

References:
See "FCD" in each of the following -

http://www.unicode.org/iuc/iuc21/a348.html
= http://oss.software.ibm.com/icu/docs/papers/normalization_iuc21.ppt

http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/collation/ICU_ 
collation_design.htm

http://oss.software.ibm.com/icu/apiref/unorm_8h.html#_details
</para>
     </comment>
   </description>
</lc-comment>

Received on Thursday, 11 July 2002 20:21:10 UTC