- From: John Cowan <cowan@ccil.org>
- Date: Sun, 26 Mar 2006 21:15:28 -0500
- To: Misha Wolf <Misha.Wolf@reuters.com>
- Cc: www-international@w3.org
Misha Wolf scripsit: > These typically (and specifically for tar and zip) do not include > media type information or charset or other type parameter information. > The information in the tar format, for example, carries time stamps, > permissions, and file type (where "type" means plain file vs. directory > vs. device, etc.). ZIP format, however, has the ability to store extended attributes, though this is normally only used on VMS and OS/2 systems. See http://www.info-zip.org/pub/infozip/doc/appnote-iz-latest.zip for the details. > The holy grail of a single unified character set that will supposedly > solve the problem sounds nice until one looks at the details. I think this line, and the rant which follows, is a serious exaggeration of the facts. > "Unicode" is itself a "vast array" (ever-increasing in number) of > character code sets. Saying "Unicode" doesn't tell me if that's > pre-"Korean mess" (see RFC 2279) "Unicode" or post-"Korean mess" > "Unicode". The pre-Korean-mess versions (1.0 and 1.1) do in fact have their own charsets: Unicode-1-1 and Unicode-1-1-UTF-8. However, no one has ever come forward with actual text encoded in anger that contains pre-mess Korean syllables, so it's almost entirely academic. > Or whether that's the "Unicode" that has among its design principles > a uniform code width of 16 bits and an encoding strictly of text > (specifically excluding musical notation), or the "Unicode" that has a > much wider code width and includes non-textual cruft such as (yes, you > guessed it) musical notation. Or whether it's one of the "Unicode"s > that has an attempt at encoding language information (versions 3.1 > and 3.2), or one of the "Unicode"s (earlier and later) that do not. > And so on. All this is about one thing and one thing only: what characters your implementation can handle. If you have an old 8-bit or 16-bit implementation trying to process text involving new characters, it won't know what to make of them, but it won't be seriously confused. New implementations can of course handle old text fine using their more advanced models, because everything is backward compatible. As for language tagging in plain text, it was demanded by an IETF WG, was in effect born deprecated, and is formally deprecated as of 4.0 but of course remains present and will forever. > as far as I know, it's not even possible to have multiple versions of > Unicode and to transcode between them on the same machine. It's not possible because (other than the Korean mess) it's not necessary either. > [S]uppose Jacob has received a text file in Korean and the issue of > labeling the charset and language is solved. If it is labeled as > "ISO-2022-KR", he can proceed to make sense of the file; conversely > if it is labeled as "utf-7" he cannot because he lacks information > to determine whether the result of the transformation to "Unicode" > should be interpreted as groups of 16 bits or some other code width, > as well as which code points represent various hangul characters. The former point is not relevant: you can interpret it as 16-bit or 21-bit codes without any difference in effect. The latter point I have already addressed. -- John Cowan cowan@ccil.org www.ccil.org/~cowan www.ap.org The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague. --Edsger Dijkstra
Received on Monday, 27 March 2006 02:15:34 UTC