- From: Markus Scherer <duerst@w3.org>
- Date: Fri, 12 Jul 2002 09:14:36 +0900
- To: www-i18n-comments@w3.org
This is a last call comment from Markus Scherer (markus.scherer@jtcsv.com) on the Character Model for the World Wide Web 1.0 (http://www.w3.org/TR/2002/WD-charmod-20020430/). Semi-structured version of the comment: Submitted by: Markus Scherer (markus.scherer@jtcsv.com) Submitted on behalf of (maybe empty): Comment type: substantive Chapter/section the comment applies to: 4.1.3 The choice of Normalization Form C The comment will be visible to: public Comment title: Suggest normalization to "FCC" not NFC Comment: Suggestion, from experience working on ICU collation, normalization, etc.: For Unicode text processing, where canonically equivalent texts should be treated in equivalent ways (e.g., compare/collate equal, string search, display), it is most efficient to work with text that is canonically ordered. This does not require NFD; it works with most but not all NFC texts. I suggest to consider using a variation of NFC that in its composition step only composes contiguously, i.e., only adjacent characters. UAX 15 NFC also composes discontiguously, skipping intermediate combining marks, which makes some NFC texts not canonically ordered. I would like to call this "FCC", "Fast C contiguous" or "Form C contiguous", similar to Mark Davis's "FCD" ("Fast C or D") for all texts that are canonically ordered. Unlike FCD, FCC is unique. Its implementation is simple: Take NFC code and disable discontiguous composition. Since almost all practical NFC texts are also FCC, it has the same advantages for the Web as NFC, and normalization from one to the other is fast (mostly a quick check + no-op). Sincerely, markus References: See "FCD" in each of the following - http://www.unicode.org/iuc/iuc21/a348.html = http://oss.software.ibm.com/icu/docs/papers/normalization_iuc21.ppt http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/collation/ICU_ collation_design.htm http://oss.software.ibm.com/icu/apiref/unorm_8h.html#_details Structured version of the comment: <lc-comment visibility="public" status="pending" decision="pending" impact="substantive"> <originator email="markus.scherer@jtcsv.com" represents="-" >Markus Scherer</originator> <charmod-section href='http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-ChoiceNFC' >4.1.3</charmod-section> <title>Suggest normalization to "FCC" not NFC</title> <description> <comment> <dated-link date="2002-07-11" >Suggest normalization to "FCC" not NFC</dated-link> <para>Suggestion, from experience working on ICU collation, normalization, etc.: For Unicode text processing, where canonically equivalent texts should be treated in equivalent ways (e.g., compare/collate equal, string search, display), it is most efficient to work with text that is canonically ordered. This does not require NFD; it works with most but not all NFC texts. I suggest to consider using a variation of NFC that in its composition step only composes contiguously, i.e., only adjacent characters. UAX 15 NFC also composes discontiguously, skipping intermediate combining marks, which makes some NFC texts not canonically ordered. I would like to call this "FCC", "Fast C contiguous" or "Form C contiguous", similar to Mark Davis's "FCD" ("Fast C or D") for all texts that are canonically ordered. Unlike FCD, FCC is unique. Its implementation is simple: Take NFC code and disable discontiguous composition. Since almost all practical NFC texts are also FCC, it has the same advantages for the Web as NFC, and normalization from one to the other is fast (mostly a quick check + no-op). Sincerely, markus References: See "FCD" in each of the following - http://www.unicode.org/iuc/iuc21/a348.html = http://oss.software.ibm.com/icu/docs/papers/normalization_iuc21.ppt http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/collation/ICU_ collation_design.htm http://oss.software.ibm.com/icu/apiref/unorm_8h.html#_details </para> </comment> </description> </lc-comment>
Received on Thursday, 11 July 2002 20:21:10 UTC