- From: Jungshik Shin <jshin@mailaps.org>
- Date: Thu, 9 Oct 2003 13:05:58 +0900 (KST)
- To: w3c-rdfcore-wg@w3.org, w3c-i18n-ig@w3.org
On Fri, 3 Oct 2003, Francois Yergeau wrote: > pat hayes wrote: > > If most, or > > even a substantial fraction, of the world is > > 'evil' then it is our job to allow them to be > > evil, and the priests can go to hell. > > NFC was designed especially so that it matches the huge majority of current > and predictable usage. Almost any string you can type with most keyboard > drivers or transcode from most legacy encodings is automatically NFC. You > will not find a substantial fraction of people doing non-NFC as a matter of > fact. I agree with you on most of points. I don't have a strong argument against requiring NFC (in Character Model) but I don't think NFC is 'cure for all' so that non-NFC on the web can be regarded as 'evil' (OK, you withdrew it) or only for virus, forgeries, and worms. From a certain point of view, NFD is more in line with the spirit of Unicode than NFC (in that we would only have NFD if there had been no legacy character set standard with which Unicode has to keep round-trip compatibility and Unicode had been devised from the scratch). Nonetheless, I believe NFC is, in _most_ cases, a sound choice in the real world. However, it has to be kept in mind that to some people (for some purposes) Unicode's model of their native scripts and Unicode normalization derived from the model are unsatisfactory. To take a Korean example I'm most familiar with, requiring NFC for Korean text works well for 'modern' Korean text only made up of pre-composed *complete* orthographically-allowed syllables. However, it gets 'inelegant' when it comes to represent old texts because pre-composed syllables in U+AC00 have to be mixed with Korean letters in U+1100. It'd be much more consistent and elegant to represent such a text exclusively in Korean letters in U+1100 block as I did in <http://i18nl10n.com/korean/hunmin.html> 'violating' CHARMODE's requirement that everything be in NFC [1]. One of reasons I did that way is it's easier to input that text in the representation I used (I don't need any complex input method at all for that text, but just a simple keymap would be sufficient. [2]) I could have done a post-processing to put it into NFC, which may help some 'clients' of my document but not all (other clients have to reverse my post-processing. For instance, a full-blown lexical analyzer for Korean text as may be necessary by search engines may need to put everything - even modern Korean syllables - in the representation I used). > What you may find is a small but scary minority who will intentionnally do > non-NFC to take advantage of holes in software that doesn't pay attention to > the issue. Worm, visuses, forgeries, that kind of things. Just like 'evil' was too strong a word, I'm afraid the above paragraph went too far. Jungshik [1] The document is not in NFD, either. Nor is it in NFKD It's in what I think should be NFD, instead <http://i18nl10n.com/korean/jamo.html> (the browser support part has to be ignored because now Mozilla does a lot better than documented there on both Windows and Unix.) Sometime before Unicode 3.0, even the compatibility decomposition of Korean letter(Jamo) clusters into basic Korean letter sequences was removed from the Unicode although that decomposition should have been promoted to the canonical decomposition. Because the Unicode normalization is _permanently frozen_, no fix is possible although there's a proposal to change it (made not to the UTC but submitted to JTC1/SC22/WG20 for a reason unknown to me http://std.dkuug.dk/JTC1/SC22/WG20/docs/N954.PDF (full); http://std.dkuug.dk/JTC1/SC22/WG20/docs/N953.PDF (summary)). I hoped a proposed introduction of tailored normalization would partly solve this problem, but the UTC decided that normalization tailoring not be a part of the Unicode standard last June. [2] I used the keyboard defined by <http://i18nl10n.com/korean/kor2v.xkb.txt>. It's for XKB (X11 keyboard extension), but it shouldn't be hard to figure out the mapping (0x10yyxxx is for U+yyxxxx). Korean script by itself doesn't need any complex input script (as shown by this keyboard mapping) and the need for the input method only arises because NFC (in Unicode)/ precomposed-syllable representation(in legacy character set) is used.
Received on Thursday, 9 October 2003 00:07:53 UTC