RE: Fwd "a comment on NFC"

On Fri, 3 Oct 2003, Francois Yergeau wrote:
> pat hayes wrote:

> > If most, or
> > even a substantial fraction, of the world is
> > 'evil' then it is our job to allow them to be
> > evil, and the priests can go to hell.
>
> NFC was designed especially so that it matches the huge majority of current
> and predictable usage.  Almost any string you can type with most keyboard
> drivers or transcode from most legacy encodings is automatically NFC.  You
> will not find a substantial fraction of people doing non-NFC as a matter of
> fact.

  I agree with you on most of points. I don't have a  strong
argument against requiring NFC (in Character Model) but I don't think NFC
is 'cure for all' so that non-NFC on the web can be regarded as 'evil'
(OK, you withdrew it) or only for virus, forgeries, and worms.  From a
certain point of view, NFD is more in line with the spirit of Unicode than
NFC (in that we would only have NFD if there had been no legacy character
set standard with which Unicode has to keep round-trip compatibility
and Unicode had been devised from the scratch).  Nonetheless, I believe
NFC is, in _most_ cases, a sound choice in the real world. However, it
has to be kept in mind that to some people (for some purposes) Unicode's
model of their native scripts and Unicode normalization derived from the
model are unsatisfactory. To take a Korean example I'm most familiar
with, requiring NFC for Korean text works well for 'modern' Korean
text only made up of pre-composed *complete* orthographically-allowed
syllables. However, it gets 'inelegant' when it comes to represent old
texts because pre-composed syllables in U+AC00 have to be mixed with
Korean letters in U+1100. It'd be much more consistent and elegant to
represent such a text exclusively in Korean letters in U+1100 block as
I did in <http://i18nl10n.com/korean/hunmin.html> 'violating' CHARMODE's
requirement that everything be in NFC [1]. One of reasons I did that way
is it's  easier to input that text in the representation I used (I don't
need any complex input method at all for that text, but just a simple
keymap would be sufficient.  [2]) I could have done a post-processing to
put it into NFC, which may help some 'clients' of my document but not
all (other clients have to reverse my post-processing. For instance,
a full-blown lexical analyzer for Korean text as may be necessary by
search engines may need to put everything - even modern Korean
syllables - in the representation I used).

> What you may find is a small but scary minority who will intentionnally do
> non-NFC to take advantage of holes in software that doesn't pay attention to
> the issue.  Worm, visuses, forgeries, that kind of things.

  Just like 'evil' was too strong a word, I'm afraid the above paragraph
went too far.

  Jungshik


[1] The document is not in NFD, either. Nor is it in NFKD It's in what
I think should be NFD, instead <http://i18nl10n.com/korean/jamo.html>
(the browser support part has to be ignored because now Mozilla does a
lot better than documented there on both Windows and Unix.)  Sometime
before Unicode 3.0, even the compatibility decomposition of Korean
letter(Jamo) clusters into basic Korean letter sequences was removed
from the Unicode although that decomposition should have been promoted
to the canonical decomposition. Because the Unicode normalization is
_permanently frozen_, no fix is possible although there's a proposal to
change it (made not to the UTC but submitted to JTC1/SC22/WG20 for a
reason unknown to me http://std.dkuug.dk/JTC1/SC22/WG20/docs/N954.PDF
(full); http://std.dkuug.dk/JTC1/SC22/WG20/docs/N953.PDF (summary)).
I hoped a proposed introduction of tailored normalization would partly
solve this problem, but the UTC decided that normalization tailoring
not be a part of the Unicode standard last June.

[2] I used the keyboard defined by
<http://i18nl10n.com/korean/kor2v.xkb.txt>. It's for XKB (X11 keyboard
extension), but it shouldn't be hard to figure out the mapping
(0x10yyxxx is for U+yyxxxx). Korean script by itself doesn't
need any complex input script (as shown by this keyboard mapping) and
the need for the input method only arises because NFC (in Unicode)/
precomposed-syllable representation(in legacy character set) is used.

Received on Thursday, 9 October 2003 00:07:53 UTC