- From: James Clark <jjc@jclark.com>
- Date: Mon, 16 Sep 1996 09:37:44 +0000
- To: tbray@textuality.com
- Cc: w3c-sgml-wg@w3.org
> Date: Fri, 13 Sep 1996 09:46:36 -0700 > From: Tim Bray <tbray@textuality.com> > If I'm going to write a program for processing XML, I'm going to use the > tools I have on the computer that sits in front of me. They can deal > with UTF8 today. Thus, I'm going to write a pure UTF8 program, with callouts > to converters for interchange with various other facilities. > > And at the front I'm going to have a little kludge along the lines of the > following: > > FirstByte = getc(stream); > if (FirstByte & 0xfe == 0xfe) > { > SecondByte = getc(stream); > temp = fopen(tempfile_name(), "w"); > ConvertUCS2StreamToUTF8(FirstByte, SecondByte, stream, temp); > fclose(stream); > stream = fopen(tempfile_name(), "r"); > FirstByte = getc(stream); > } I wouldn't write anything like that. If you're writing a program to handle UTF-8 data and it does any non-trivial manipulation of the data (such as displaying it), the first thing you typically do is combine the bytes representing each character into a single integer (usually a wchar_t). The program then processes this stream of wchar_t's just as a non-Unicode program would process a stream of char's. The conversion from UTF-8 to wchar_t is quite complicated and expensive. The conversion from UCS-2 to wchar_t is very simple. In fact, it is often a noop (if your wchar_t is 16 bits and bytes don't need to be swapped). So what I would end up doing would be in effect converting everything to UCS-2 or UCS-4. That's why I think UCS-2/UTF-16 has just as much right as UTF-8 to be considered the canonical way of encoding Unicode/ISO 10646. James
Received on Monday, 16 September 1996 04:43:26 UTC