Re: XML character sets: the hard-minimalist manifesto from James Clark on 1996-09-16 (w3c-sgml-wg@w3.org from September 1996)

From: James Clark <jjc@jclark.com>
Date: Mon, 16 Sep 1996 09:37:44 +0000
To: tbray@textuality.com
Cc: w3c-sgml-wg@w3.org
Message-Id: <9609160837.AA28626@jclark.com>

> Date: Fri, 13 Sep 1996 09:46:36 -0700
> From: Tim Bray <tbray@textuality.com>

> If I'm going to write a program for processing XML, I'm going to use the
> tools I have on the computer that sits in front of me.  They can deal
> with UTF8 today.  Thus, I'm going to write a pure UTF8 program, with callouts
> to converters for interchange with various other facilities.
> 
> And at the front I'm going to have a little kludge along the lines of the
> following:
> 
>  FirstByte = getc(stream);
>  if (FirstByte & 0xfe == 0xfe) 
>  {
>    SecondByte = getc(stream); 
>    temp = fopen(tempfile_name(), "w");
>    ConvertUCS2StreamToUTF8(FirstByte, SecondByte, stream, temp);
>    fclose(stream);
>    stream = fopen(tempfile_name(), "r");
>    FirstByte = getc(stream);
>  }

I wouldn't write anything like that.

If you're writing a program to handle UTF-8 data and it does any
non-trivial manipulation of the data (such as displaying it), the
first thing you typically do is combine the bytes representing each
character into a single integer (usually a wchar_t).  The program then
processes this stream of wchar_t's just as a non-Unicode program would
process a stream of char's.

The conversion from UTF-8 to wchar_t is quite complicated and
expensive.  The conversion from UCS-2 to wchar_t is very simple.  In
fact, it is often a noop (if your wchar_t is 16 bits and bytes don't
need to be swapped).  So what I would end up doing would be in effect
converting everything to UCS-2 or UCS-4.  That's why I think
UCS-2/UTF-16 has just as much right as UTF-8 to be considered the
canonical way of encoding Unicode/ISO 10646.

James

Received on Monday, 16 September 1996 04:43:26 UTC