- From: James Clark <jjc@jclark.com>
- Date: Mon, 16 Sep 1996 09:37:44 +0000
- To: tbray@textuality.com
- Cc: w3c-sgml-wg@w3.org
> Date: Fri, 13 Sep 1996 09:46:36 -0700
> From: Tim Bray <tbray@textuality.com>
> If I'm going to write a program for processing XML, I'm going to use the
> tools I have on the computer that sits in front of me. They can deal
> with UTF8 today. Thus, I'm going to write a pure UTF8 program, with callouts
> to converters for interchange with various other facilities.
>
> And at the front I'm going to have a little kludge along the lines of the
> following:
>
> FirstByte = getc(stream);
> if (FirstByte & 0xfe == 0xfe)
> {
> SecondByte = getc(stream);
> temp = fopen(tempfile_name(), "w");
> ConvertUCS2StreamToUTF8(FirstByte, SecondByte, stream, temp);
> fclose(stream);
> stream = fopen(tempfile_name(), "r");
> FirstByte = getc(stream);
> }
I wouldn't write anything like that.
If you're writing a program to handle UTF-8 data and it does any
non-trivial manipulation of the data (such as displaying it), the
first thing you typically do is combine the bytes representing each
character into a single integer (usually a wchar_t). The program then
processes this stream of wchar_t's just as a non-Unicode program would
process a stream of char's.
The conversion from UTF-8 to wchar_t is quite complicated and
expensive. The conversion from UCS-2 to wchar_t is very simple. In
fact, it is often a noop (if your wchar_t is 16 bits and bytes don't
need to be swapped). So what I would end up doing would be in effect
converting everything to UCS-2 or UCS-4. That's why I think
UCS-2/UTF-16 has just as much right as UTF-8 to be considered the
canonical way of encoding Unicode/ISO 10646.
James
Received on Monday, 16 September 1996 04:43:26 UTC