Re: FW: Charset mandatory in unix/linux from John Cowan on 2006-03-27 (www-international@w3.org from January to March 2006)

From: John Cowan <cowan@ccil.org>
Date: Sun, 26 Mar 2006 21:15:28 -0500
To: Misha Wolf <Misha.Wolf@reuters.com>
Cc: www-international@w3.org
Message-ID: <20060327021527.GD15284@ccil.org>
Misha Wolf scripsit:

> These typically (and specifically for tar and zip) do not include
> media type information or charset or other type parameter information.
> The information in the tar format, for example, carries time stamps,
> permissions, and file type (where "type" means plain file vs. directory
> vs. device, etc.).

ZIP format, however, has the ability to store extended attributes,
though this is normally only used on VMS and OS/2 systems.  See
http://www.info-zip.org/pub/infozip/doc/appnote-iz-latest.zip for the
details.

> The holy grail of a single unified character set that will supposedly
> solve the problem sounds nice until one looks at the details.

I think this line, and the rant which follows, is a serious exaggeration
of the facts.

> "Unicode" is itself a "vast array" (ever-increasing in number) of
> character code sets.  Saying "Unicode" doesn't tell me if that's
> pre-"Korean mess" (see RFC 2279) "Unicode" or post-"Korean mess"
> "Unicode".

The pre-Korean-mess versions (1.0 and 1.1) do in fact have their own
charsets: Unicode-1-1 and Unicode-1-1-UTF-8.  However, no one has ever
come forward with actual text encoded in anger that contains pre-mess
Korean syllables, so it's almost entirely academic.

> Or whether that's the "Unicode" that has among its design principles
> a uniform code width of 16 bits and an encoding strictly of text
> (specifically excluding musical notation), or the "Unicode" that has a
> much wider code width and includes non-textual cruft such as (yes, you
> guessed it) musical notation.  Or whether it's one of the "Unicode"s
> that has an attempt at encoding language information (versions 3.1
> and 3.2), or one of the "Unicode"s (earlier and later) that do not.
> And so on.

All this is about one thing and one thing only:  what characters
your implementation can handle.  If you have an old 8-bit or 16-bit
implementation trying to process text involving new characters, it
won't know what to make of them, but it won't be seriously confused.
New implementations can of course handle old text fine using their more
advanced models, because everything is backward compatible.

As for language tagging in plain text, it was demanded by an IETF WG,
was in effect born deprecated, and is formally deprecated as of 4.0 but
of course remains present and will forever.

> as far as I know, it's not even possible to have multiple versions of
> Unicode and to transcode between them on the same machine.

It's not possible because (other than the Korean mess) it's not necessary
either.

> [S]uppose Jacob has received a text file in Korean and the issue of
> labeling the charset and language is solved.  If it is labeled as
> "ISO-2022-KR", he can proceed to make sense of the file; conversely
> if it is labeled as "utf-7" he cannot because he lacks information
> to determine whether the result of the transformation to "Unicode"
> should be interpreted as groups of 16 bits or some other code width,
> as well as which code points represent various hangul characters.

The former point is not relevant: you can interpret it as 16-bit or
21-bit codes without any difference in effect.  The latter point I have
already addressed.

-- 
John Cowan  cowan@ccil.org  www.ccil.org/~cowan  www.ap.org
The competent programmer is fully aware of the strictly limited size of his own
skull; therefore he approaches the programming task in full humility, and among
other things he avoids clever tricks like the plague.  --Edsger Dijkstra
Received on Monday, 27 March 2006 02:15:34 UTC