- From: John C Klensin <KLENSIN@INFOODS.UNU.EDU>
- Date: Mon, 25 Oct 1993 05:25:13 -0400 (EDT)
- To: dank@BLACKS.JPL.NASA.GOV
- Cc: mohta@NECOM830.CC.TITECH.AC.JP, ietf-wnils@UCDAVIS.EDU, ietf-charsets@INNOSOFT.COM
>It seems to me that English and Greek characters need separate code points >because their visual appearance is significantly different, not because >they are from different languages. Actually, Dan, a lot of other issues aside, you have hit on one of the critical issues here. Ohta-san has responded on this, but let me try a bit of a generalization. There are two issues that might usefully be thought of as separate: (1) "visual appearance is significantly different" is largely in the eye of the beholder. Is the Latin lower-case "a" the same, or "significantly different" from Greek lower-case alpha? Be careful about the answer, because it may be different in different fonts, and typography is supposed to not be an issue here. (2) To the degree that there are *any* letter-symbols that we can agree are not "significantly different" in Greek and Latin character sets (let's stick with alpha and look at its upper-case form as Ohta-san did), one then can make a choice between--starting from a traditionally ASCII-based world-- "ASCII characters with Greek supplement" and "separate contiguous code points for basic Latin and Greek characters". The former creates a smaller number of total codes because, e.g., Greek upper-case alpha does not get assigned a code point separate from Latin upper-case A. The latter preserves some collating integrity, some useful relationships between, e.g. upper case and lower case character sets, and maybe has some cultural merit (which moves dangerously close to "because they are different languages"). But the latter yields much larger total character sets, because similar symbols are assigned to separate code points under some set of rules. The "ASCII with supplemental Greek characters" approach is known in the character set community as "unification". One of the several objections to IS 10646 and UNICODE in the Asian character set community is that North American and European-dominated committees and design teams were a lot more willing to "unify" characters deriving from Chinese ("Han") characters than they were to unify characters deriving from, e.g., Greek or North Semitic. A few observations on your summary... The ISO Universal Character Set (sic) standard is 10646, not 16046. There is no UCS-3, only UCS-2 (16 bit, equivalent to UNICODE in code points, but possibly with slightly different semantics and conformance rules) and UCS-4 (32 bit). There is actually a community of objections to UTF-2. They are based on: (1) For email purposes, and other situations with 7-bit constraints, UTF-2, by using an 8-bit form, requires double encoding. There are direct encodings of 16 or 32 bits to 7 bits that save time and maybe space. (2) The variable-length nature of UTF-2 is optimal for ASCII and code points "low" in the 10646 sequence. It is pretty bad for the "upper end" of the BMP (UNICODE, UCS-2), and could get really pathological if the "high end" code positions of 10646 were used. So, to a certain extent, choosing it requires assuming that those higher code positions will never be used, or that the communities that will use them are never going to be important to the Internet. A straight 32-bit coding, possibly supplemented by conventional compression, does not have that problem. --john --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Monday, 25 October 1993 02:27:02 UTC