- From: Borka Jerman-Blazic <jerman-blazic@ijs.si>
- Date: Fri, 05 Nov 1993 09:10:32 +0100
- To: ietf-charsets@INNOSOFT.COM
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The address of Mr.Masami Hasegawa from his business card is: ENET:JRDV04::MA_HASEGAWA LOC.CODE:JRD The chair of the Japanese delegation to the Rennes meeting was Prof.Wada. His address is the following: wada@ccut.utyo.junet.jp I will try to find out their newer e-mail addresses and mail them to you. Forwarded messages: +-+-+-+-+-+-+-+-+-+ C I N E T - L N e w s l e t t e r +-+-+-+-+-+-+-+-+-+ Issue No. 5, Sunday, October 31, 1993 +--------------------------------------------------------------------------+ | China's InterNET Technical Forum (CINET-L) is a non-public discussion | | list. CINET-L is technically sponsored by China News Digest and CINET-L | | newsletter is published by volunteers in CINET-EDITOR@CND.ORG. For more | | information regarding CINET-L, please see the end of this message. | +--------------------------------------------------------------------------+ Table of Contents # of Lines ============================================================================ 1. News Briefs (2 Items) ................................................ 21 2. Book Review: The New Nutshell Handbook on sendmail ................... 70 3. Move Over, ASCII! Unicode Is Here .................................. 310 4. Unicode, Wide Characters, and C ..................................... 344 ============================================================================ - ---------------------------------------------------------------------------- 1. News Briefs (2 Items) ................................................ 21 - ---------------------------------------------------------------------------- Contributed by: Hao Xin of Computer Network Center NCFC Date: October 27, 1993 The local distribution list at CNC for the China InterNET Tech Forum (CINET-L) has grown to include 21 users on CASnet from zero a month ago. Most of the users on the local distribution list are graduate students and young staff members who do not have direct international email access yet. Other users on the local list do have international mail access, but choose to receive CINET-L via local distribution to save network traffic and money. ___ ___ ___ Contributed by: Hao Xin of Computer Network Center NCFC Date: October 27, 1993 More than two weeks after their arrival on Chinese shores, the long-awaited DEC computers for the NCFC project now get to see the light of day: they are being taken out of their crates today. The equipment includes a VAX4000/100, a VAX Station 4000, a DEC Station 5000/25, and an NIS 600 multi-protocol router. They will be used for the final link of the three campus networks (Beijing Univ., Tsinghua, and CAS) which comprise the NCFC project. It is hoped that the NCFC network will be functioning by the end of January 1994. - ---------------------------------------------------------------------------- 2. Book Review: The New Nutshell Handbook on sendmail ................... 70 - ---------------------------------------------------------------------------- Fowarded by: Liedong Zheng Title: Sendmail By Bryan Costales, with Eric Allman & Neil Rickert 1st Edition November 1993 (est.) 750 pages (est.), ISBN: 1-56592-056-2, $32.95 (est.) Book Description: This new Nutshell Handbook is far and away the most comprehensive book ever written on sendmail, a program that acts like a traffic cop in routing and delivering mail on UNIX-based networks. Although sendmail is used on almost every UNIX system, it's one of the last great uncharted territories--and most difficult utilities to learn-- in UNIX system administration. This book provides a complete sendmail tutorial, plus extensive reference material on every aspect of the program. What's more, it's authoritative, having been co-authored by Eric Allman, the developer of sendmail, and Neil Rickert, one of the leading sendmail gurus on the Net. The book covers both major versions of sendmail: the standard version available on most systems, and IDA sendmail, a version from Europe that uses a much more readable configuration file. Part One of the book is a tutorial on understanding sendmail from the ground up. Starting from an empty file, it has the reader work through exercises, building a configuration file and testing the results. Part Two covers practical issues in sendmail administration, while Part Three is a comprehensive reference section. Author Information: Bryan Costales is System Manager at the International Computer Science Institute in Berkeley, California. He has been writing articles and books about computer software for over ten years. His most notable books are C from A to Z (Prentice Hall), and Unix Communications (Howard Sams). In his free time (chuckle, chuckle) he sails San Francisco Bay in his 26 foot sloop, goes camping with his Land Rover, and walks his dog Zypher. He is an avid movie viewer, reads tons of science fiction, and plays chess and volleyball. Eric Allman is the Lead Programmer on the Mammoth Project at the University of California at Berkeley. This is his second incarnation at Berkeley; previously, he was the Chief Programmer on the INGRES database management project. In addition to his assigned tasks, he got involved with the early UNIX effort at Berkeley. His first experiences with UNIX were with 4th Edition, and he still has the manuals to prove it (and has been accused of being a pack rat because of it). Over the years, he wrote a number of utilities that appeared with various releases of BSD, including the -me macros, tset, trek, syslog, vacation, and of course sendmail. Eric spent the years between the two Berkeley incarnations at Britton Lee (later Sharebase) doing database user and application interfaces, and at the International Computer Science Institute, contributing to the Ring Array Processor project for neural-net-based speech recognition. He also co-authored the "C Advisor" column for Unix Review for several years. Eric has been accused of working incessantly, enjoys writing with fountain pens, and collects wines which he stashes in the cellar of the house that he shares with Kirk McKusick, his partner of 14 and-some-odd years. He is a member of the Board of Directors of USENIX Association, which is much more work than he had expected. Neil Rickert earned his Ph.D. at Yale in Mathematics. He is currently a professor of computer science at Northern Illinois University. He likes to keep contact with the practical side of computing, and so spends part of his time in UNIX system adminstration. He has been involved with the IDA sendmail project, and is largely responsible for the current version of the IDA configuration. - ---------------------------------------------------------------------------- 2. Move over, ASCII! Unicode Is Here .................................. 310 - ---------------------------------------------------------------------------- Forwarded by: Liu Jian Source: PC Magazine, October 26, 1993 Written by: Petzold, Charles A great concept deserves a great name, and that name is Unicode. Say it to yourself a few times. Get used to it. The prefix uni is from the Latin word for one, and the word code (as defined in sense 3a in The American Heritage Dictionary) is "A system of signals used to represent letters or numbers in transmitting messages." Say it again: Unicode. How hip and mellifluous the word is, particularly when compared with ASCII (pronounced ASS-key), EBCDIC (EB-see-dik), or even Baudot (baw-DOE). These comparisons are quite valid, for the goal of Unicode is nothing less than to dislodge and replace what is perhaps the most dominant standard in personal computing--the American Standard Code for Information Interchange. Ambitious? Of course. But Unicode makes so much sense, it seems inevitable. Check out some of the companies that collaborated in The Unicode Consortium to bring Unicode about: IBM, Microsoft, Apple, Xerox, Sun, Digital, Novell, Adobe, NeXT, Lotus, and WordPerfect. With the release of Windows NT, Unicode has become not just a proposed standard, but a reality, and in the next couple of issues we'll take a look at that reality. Let's begin, however, with a historical perspective, and why Unicode is so important to the future of computing. CODING LANGUAGE Human beings differ from other species in their comparatively high level of communication and the development of spoken language. The need to record spoken language led to writing, which makes it possible to preserve and convey knowledge and experience. Computers and other digital systems work entirely with numbers, so to represent text in our computers, it is necessary to create an equivalence between numbers and characters. Until the invention of the telegraph by Samuel Morse in the mid-1800s, long-distance communication required letters to be transported by person, horse, or train. The telegraph made long-distance communication nearly instantaneous by transmitting a series of electrical pulses through a wire. But what do electrical pulses have to do with language? The telegraph required that a code be devised correlating each letter in the alphabet with a particular series of short and long pulses (dots and dashes) that sounded like clicks on the receiving end. Morse code was not the first instance of written language being represented by something other than drawn or printed glyphs. Braille came earlier and was inspired by a system for coding secret military messages. And Morse code was not a binary system: The long and short pulses had to be separated by different delays between letters and words. Binary systems for representing written language (letters represented by a fixed-length series of 0s and 1s) came later. One of the early binary systems used in telexes was called Baudot (named after a French engineer who died in 1903). Baudot was a 5-bit code. Normally, the use of 5 bits is limited to representing 32 characters, which is sufficient for the 26 characters of the alphabet (not differentiated by case) but not much else. However, one Baudot code represented a "shift" that made subsequent codes map to numbers and punctuation symbols. This feature extended Baudot to be nearly as extensive as a 6-bit code. The American Standard Code for Information Interchange (ASCII) was crowned as a standard by the American National Standards Institute (ANSI) some 20 years ago. As defined by ANSI, ASCII is a 7-bit code, of which the first 32 codes and the last code are control characters (such as a carriage return, line-feed, and tab). That leaves room for 26 lowercase letters, 26 uppercase letters, 10 numbers, and 33 symbols and punctuation marks. ASCII has become the dominant standard for all computers, except for mainframes made by a not-insignificant company called IBM. The IBM heavy iron machines use an 8-bit system called the Extended Binary Coded Decimal Interchange Code (EBCDIC). Using 8 bits should allow for twice as many codes as ASCII, but much of the EBCDIC code space is not assigned. One peculiarity is that EBCDIC doesn't represent the alphabet with consecutive codes--the capital letters A through I are hexadecimal codes 0xC1 through 0xC9; J through R are 0xD1 through 0xD9; and S through Z are 0xE2 through 0xE9. This only makes sense when you see the patterns on punch cards! THE WORLD BEYOND OUR BORDERS With the exception of IBM mainframes, ASCII is just about the only standard common among computers. No other standard is as prevalent or as ingrained in our keyboards, video displays, system hardware, printers, font files, operating systems, electronic mail, and information services. But there's a big problem with ASCII, and that problem is indicated by the first word of the acronym. ASCII is truly an American standard, but there's a whole wide world outside our borders where ASCII is simply inadequate. It isn't even good enough for countries that share our language, for where is the British pound sign in ASCII? Among written languages that use the Latin (or Roman) alphabet, English is unusual in that almost all of our words use the bare letters without accent marks. Go across the Atlantic and take a look at the French, German, or Swedish languages in print to see a variety of diacritics that originally aided in adopting the Latin alphabet to the differences in spoken sounds among these languages. Journey farther east or south, and you'll encounter written languages that don't use the Latin alphabet at all, such as Greek, Hebrew, Arabic, and Russian (which uses Cyrillic). And if you travel even farther east, you'll discover the logographic Han characters of Chinese, which were also adopted in Japan and Korea. (Interestingly enough, in Vietnam you'll come across the Latin alphabet again, a triumph of sorts for early missionaries!) I live in one of the most ethnically diverse cities of the world--New York. Every day I witness this diversity in a potpourri of languages heard and seen on the streets. There are Ukrainian churches, Korean delicatessens, Chinese restaurants, Pakistani newsstands, and subway advertisements in languages I don't even recognize. And then I come home and use ASCII, a character-encoding system that is not only inadequate for the written languages of much of the world, but also for many people who live right in my own neighborhood. We simply can't be so parochial as to foster a system as exclusive and limiting as ASCII. The personal computing revolution is quickly encompassing much of the world, and it's totally absurd that the dominant standard is based solely on English as spoken in the U.S. I can't pretend to be dispassionate on this subject. The character encoding used in our computers must truly reflect the diversity of the world's people and languages. ONGOING ATTEMPTS Of course, there have been some partial solutions to this problem. Because ASCII is a 7-bit code, and 8-bit bytes have become common in many systems, it is possible to extend ASCII with another 128 characters. The original IBM extended character set included some accented characters and a lowercase Greek alphabet (useful for mathematics notation), as well as some block- and line-drawing characters. Unfortunately, this extended character set did not include enough accented letters for all European languages that used the Latin alphabet, so alternative extended character sets were devised. These are called code pages, and they still exist in DOS and (in great profusion) in OS/2. OS/2 users and programs can switch among code pages and get a different mapping of 8-bit codes to characters. An OS/2 program can even select extended EBCDIC code pages for over ten different languages! Microsoft didn't entirely abandon the IBM extended character set in the first and subsequent versions of Windows, but most Windows fonts were built around an alternative extended character set. Microsoft called this the "ANSI character set," but it was actually based on an ISO (International Standards Organization) standard. The ANSI character set abandons the block- and line-drawing characters to include more accented characters that are useful for European languages employing the Latin alphabet. But what about non-Latin alphabets? Some font vendors devised solutions to rendering other alphabets (such as Hebrew) with fonts designed specifically for that purpose. With such fonts, ASCII codes normally corresponding to the Latin alphabet are mapped to characters in other alphabets. With either code pages or alternative fonts, the interpretation of 8-bit character codes is ambiguous because it depends upon the selected code page or font. And then there's the problem of communicating with the Macintosh, which uses a different extended character set than either the original IBM PC or Windows uses. Even when communicating over electronic mail in American English, I sometimes see odd characters in letters from my Mac-user friends. Another response to the limitations of ASCII is the double-byte character set (DBCS). With DBCS, some characters require 1 byte and some require 2 bytes (indicated by an initial byte greater than hexadecimal 0x80). This system allows representing both the ASCII character set and a non-Latin alphabet. DBCS has problems of its own, though, such as multiple standards. Also, because DBCS characters are not of uniform length, programmed parsing becomes difficult. For example, you can't simply skip ahead 6 characters by skipping ahead 6 bytes. You have to look at each and every character to see if it's represented by 1 or 2 bytes. UNICODE TO THE RESCUE The basic problem is that the world's written languages simply cannot be represented by only 256 8-bit codes. The previous solutions have proven insufficient and awkward. What's the real solution? It doesn't take a genius to figure out that if 8 bits are inadequate, then 16 bits might be just fine. Congratulations! You've just invented Unicode! Unicode is truly as simple as that: Rather than the confusion of multiple 256-character code mappings or double-byte character sets that have some 1-byte codes and some 2-byte codes, Unicode is a uniform 16-bit system, thus allowing the representation of 65,536 characters. This is sufficient for all the most common characters and logographs in all the written languages of the world (including some math and symbol collections), with about half the code space left free. Sixteen-bit characters are often called wide characters. Wide characters do not necessarily have to be Unicode characters, although for our purposes I'll tend to use the terms Unicode and wide characters synonymously. THE WINDOWS NT SUPPORT Of course, dealing with character codes that are 16 bits in length rather than 8 is quite a foreign concept to many of us. We are so accustomed to identifying a character with 8 bits that it seems unnatural and impossible. What about our operating systems? What about our programming languages? What about our hardware and printers? It's really not as bad as it sounds, although obviously quite a few years will pass before Unicode replaces ASCII as the universal system of character coding. Still, some essential support is already falling into place. You can write programs for Windows NT that continue to use the ASCII character set, or you can write programs that use Unicode. You can even mix the use of ASCII and wide characters in the same program. How does this work? Well, every function call in Windows NT that requires a character string as a parameter (and there are quite a few of them) has two different entry points in the operating system. For example, there is a TextOutA function (the ASCII version) and a TextOutW function (the wide-character version); depending on the definition of an identifier, the name TextOut is defined as one or the other of them. We'll see how this works in more detail in a future column. The ANSI standard for C also has support for wide characters, and this support is included in Microsoft's C compiler for Windows NT. Rather than using the strlen() function to find the length of a string, for example, you can use wcslen() (which translates to "wide character string length"). Instead of using sprintf() to format a string for display, you can use swprintf(). What about displaying non-Latin characters on our video displays and printers? Well, TrueType also supports wide character sets, and while a TrueType font file containing all the Unicode characters might be somewhere in the region of 5 to 10 megabytes, that's not an inordinate size for representing characters of all the world's written languages. THE REFERENCE BOOKS Unicode is documented in two volumes compiled by The Unicode Consortium and published by Addison Wesley in 1991 called The Unicode Standard: Worldwide Character Encoding, Version 1.0. Because the books contain charts showing all the characters of Unicode, they are marvelous to explore, and I highly recommend them. These books reveal the richness and diversity of the world's written languages in a way that few other documents have. In addition, the books provide the rationale and details behind the development of Unicode. You'll probably be pleased to know that the first 128 codes of Unicode are identical to ASCII, thus facilitating a conversion from ASCII to Unicode. (Just add another zero byte to each ASCII character.) The second 128 characters are called the Latin 1 character set, which is the same as the Windows character set except that hexadecimal codes 0x0080 through 0x009F are defined as control characters in Latin 1. Many blocks of non-Latin characters are also based on existing standards, also easing conversions. Codes 0x0100 through 0x01FF provide additional variations of the Latin character set. The codes 0x0400 through 0x04FF are for Cyrillic. Armenian, Hebrew, and Arabic come next, and soon you'll encounter more esoteric languages such as Devanagari (used in classical Sanskrit and modern Hindi), Bengali, Gurmukhi, and Gujarati (all North Indian scripts). And on and on and on. The famous Zapf Dingbats character set uses codes 0x2700 through 0x27BF, and the Han ideographs begin at 0x4E00. These Han characters are used to represent whole words or concepts in Chinese, and they are also used in Japanese and Korean. Unicode contains over 20,000 Han characters, about a third of the entire code space. WHAT UNICODE DOESN'T ADDRESS Sorting words in English is made easier by the consecutive coding of letters in the ASCII character set. The coding of characters in Unicode does not imply any collation sequence, and if you think about it, it doesn't make much sense to pretend that we know how to alphabetize a collection of words using different alphabets such as Latin, Hebrew, and Bengali. Even sorting English words is not as straightforward as ASCII would imply, because alphabetizing words is usually case-insensitive. Thus, sorting always requires special consideration beyond the simple numeric sequence of character codes. And even with extended ASCII character sets, sorting gets more complex with the accented Latin letters of many European languages. But at least with Unicode we have a consistent encoding of accented letters, so a table could be created that allows reasonable sorting. Another issue Unicode doesn't address is the use of similar alphabets and logographs in different countries. For example, there are no separate French, German, or Finnish character sets within Unicode. The written languages of these countries share many unaccented and accented Latin characters. The situation is similar for the Han logographs. Quite often, the same Han character represents something different depending on whether it's used in Chinese, Japanese, or Korean. Unicode makes no attempt to reflect these differences. If the character is in Unicode, then it can be used in any of those three languages, regardless of its meaning. Another problem with international programming is that some written languages do not run from left to right on the printed page. Unicode specifies that character strings be stored in logical order, which is the order that someone would type the characters from a keyboard. Properly displaying such text is left up to an application, but the Unicode reference books contain some information on this issue. CAN IT BE DONE? Unicode is certainly an important step in the move towards truly international programming. The question of whether it can really replace ASCII as the standard for worldwide character coding is almost irrelevant. It simply must. In my next columns on this subject, I'll refrain from further proselytizing and will focus on the programming mechanics involved in using Unicode. - ---------------------------------------------------------------------------- 2. Unicode, Wide Characters, and C ..................................... 344 - ---------------------------------------------------------------------------- Forwarded by: Liu Jian Source: PC Magazine, November 9, 1993 Written by: Petzold, Charles People who write about computers in more general interest magazines often avoid using the word byte, instead describing storage capabilities in terms of characters. That's a pretty simple conversion, because one character of text requires 1 byte of storage. Right? Wrong! When dealing with ASCII character sets, the equivalence is certainly correct. But ASCII character sets (even when extended by another 128 codes) are unable to represent anything beyond the Latin alphabet and some accented letters used in European alphabets. Several makeshift solutions exist, such as non-Latin fonts and double-byte character sets (DBCS). In a DBCS, some characters require 1 byte and some require 2; those requiring 2 bytes are used for Far Eastern languages. A far better solution for international computing is to replace ASCII with a uniform 2-byte character encoding. As I discussed in the last issue, the system that shows the most promise of becoming a standard is Unicode. Unicode was developed by a consortium of big names in the computer industry and is supported by Windows NT. Its first 128 codes are the same as ASCII, but it is capable of representing all the characters of all the written languages of the world. It may even come to pass someday that journalists who write about computers in general-interest magazines will have to adjust their convenient equivalence of bytes and characters. (Presumably they can divide by 2!) When Unicode text is stored in memory or files, character strings are represented as a series of 16-bit values rather than as bytes. The first time I encountered this concept, I got the shivers. How on earth do you use your favorite programming language with Unicode? Luckily, other people have considered that problem, and support for "wide characters" (as they are called) is part of the ANSI C standard. That's what I'll examine in this issue. EIGHT-BIT CHARACTERS We all know how to store characters and character strings in our C programs. You simply use the char data type. But to facilitate an understanding of how C handles wide characters, let's first review normal character definition. The following statement defines and initializes a variable containing a single character: char c = 'A' ; The variable c requires 1 byte of storage containing the value 65 (the hexadecimal value 0x41), which is the ASCII code for the letter A. You can define a pointer to a character string like so: char * p ; Windows NT, being a 32-bit operating system, reserves 4 bytes of storage for the character pointer. You can also initialize a pointer to a character string: char * p = "Hello!" ; In this case, the variable p requires 4 bytes of storage, and the character string is stored in static memory using 7 bytes of storage--the 6 bytes of the string plus a terminating zero. You can also define an array of characters, like this: char a[10] ; In this case, the compiler reserves 10 bytes of storage for the array. If the array variable is global (outside any function), you can initialize it with char a[] = "Hello!" ; If you define this array as a local variable to a function, it must be defined as a static variable, as follows: static char a[] = "Hello!" ; In either case, the string is stored in static program memory with a zero appended at the end, thus requiring 7 bytes of storage. LET'S GET WIDER The char data type continues to be a single-byte value in C. To use 16-bit wide characters in a C program, you must include the WCHAR.H (wide character) header file in your program: #include <WCHAR.H> This header file contains definitions of new data types, structures, and functions for using wide characters. In particular, WCHAR.H defines the new data type wchar, t as typedef unsigned short wchar, t ; Although the int data type has grown from 16 bits to 32 bits under the Windows NT C compiler, the short data type is still a 16-bit value. Thus, the wchar, t data type is the same as an unsigned short integer. To define a variable containing a single wide character, you use the following statement: wchar, t c = 'A' ; The variable c is the two-byte value 0x0041, the Unicode representation of the letter A. (However, given the Intel protocol of storing least-significant bytes first, the bytes are actually stored in memory in the sequence 0x41, 0x00. Keep this fact in mind as we examine the output of a sample program shortly.) You can also define and initialize a pointer to a wide character string: wchar, t * p = L"Hello!" Notice the L (for long) directly preceding the first quotation mark. This indicates to the compiler that the string is to be stored with wide characters, that is, with every character occupying 2 bytes. The variable p requires 4 bytes of storage, as usual, but the character string requires 14 bytes--2 bytes for each character with two bytes of zeros at the end. Similarly, you can define an array of wide characters this way: wchar, t a[] = L"Hello!" ; The string again requires 14 bytes of storage. Although it looks rather ugly and unnatural, that L preceding the first quotation mark is very important, and there must not be a space between the two symbols. Only with that L will the compiler know you want the string to be stored with 2 bytes per character. Later on, when we look at wide character strings in places other than variable definitions, you'll encounter the L preceding the first quotation mark again. Fortunately, the C compiler will often give you an error message if you forget to include the L. LET'S TRY IT OUT If the concept of wide character strings is new to you, you're definitely not alone, and you may be skeptical that this can really work. So let's try it out. The UNITEST1 program is shown in Figures 1 and 2. To compile and run this program, you'll need Windows NT 3.1 and the Microsoft Win32 Software Development Kit installed. You can compile and link the program using the command line NMAKE UNITEST1.MAK Notice that the UNITEST1.C program includes the WCHAR.H header file at the beginning. The program defines two character strings (the text "Hello, world!"), one using the char data type and the other using wchar, t. The two variable names for these character strings are acString (the ASCII version) and wcString (the wide-character version). UNITEST1 then uses the printf function to display each string, determine its storage size, determine the number of characters using the strlen function, and then display the first five characters in both character and hexadecimal format. The first thing you'll notice when compiling the program is a warning message reading (in part) "incompatible types - from 'unsigned short[14] to 'const char*'." This message results from passing the wcString variable to the strlen function, which expects a pointer to a character string but gets instead a pointer to a string of short integers. It's only a warning message, so the compilation will continue and you can run the program. The results are shown in Figure 3. The top half of the output looks fine, exactly as we expected. But what happened with the wide character string? First, printf simply displays the string as 'H. Why is this? Well, printf expected a string of single-byte characters terminated by a zero byte. The first character has a 16-bit hexadecimal representation of 0x0048. But these bytes are stored in memory in the sequence 0x48, 0x00. The printf function thus assumed that the string was only one character long. Similarly, the strlen function reported that the string was only a single character. Everything else seems to work. In particular, the sizeof operator reported that the ASCII string required 14 bytes of storage, and the wide character string required 28 bytes of storage. Also, indexing the wchar, t array correctly retrieves each character of the string for printf to display. This program clearly illustrates the differences between the C language itself and the runtime library functions. The compiler interprets the string L"Hello, world!" as a collection of 16-bit short integers and stores them in the wchar, t array. The compiler also handles the array indexing and the sizeof operator correctly. But the runtime library functions strlen and printf are added during link time. These functions expect strings comprised of single-byte characters. When confronted with wide character strings, they don't perform as we'd like. THE WCHAR LIBRARY FUNCTIONS The solution is alternate runtime library functions that accept wide character strings rather than single-byte character strings. Fortunately, such functions exist in Microsoft's 32-bit C compiler package, and they're all defined in WCHAR.H. For example, the wide-character version of the strlen function is called wcslen (wide character string length); the wide-character version of the printf function is called wprintf. Let's put these two functions to use in the UNITEST2 program shown in Figures 4 and 5. Notice that in the second part of the UNITEST2 program, the strlen program has been replaced with wcslen, and all the printf functions have been replaced with wprintf (although only one of them gave us trouble in UNITEST1). The only other code change is that a capital L now precedes the formatting string in the wprintf functions. From personal experience, I guarantee you'll frequently forget to include the L when you first start working with wide character strings. When you use the wide character functions defined in WCHAR.H, every string you pass to them must be composed of wide characters. The output of UNITEST2 is shown in Figure 6. This is what we want. Although the size of the wide character string is 28 bytes (the 13 wide characters plus the terminating 16-bit zero), the wcslen function reports 13 characters. Keep in mind that the character length of a string does not change when you move to wide characters--only the byte length changes. And as I explained earlier, a byte is not necessarily a character. BUT WHAT'S THE POINT? Of course we haven't yet established any real benefit to using Unicode in these two programs. We're still displaying pure ASCII text in character mode. The character mode font in the U.S. version of Windows NT isn't capable of displaying the extra Unicode characters. If such characters appeared in a string, they'd simply be ignored upon display. (You can test this by inserting, for example, the character x0413 into the wcString array. This is the character code for a letter in the Cyrillic alphabet.) Of course, where Unicode is most important is in graphical Windows NT programs. Indeed, the retail release of Windows NT is shipped with a TrueType font containing a small subset of the complete Unicode character set. It's called Lucida Sans Unicode, and it includes additional accented Latin letters; the Greek, Cyrillic, and Hebrew alphabets; and a bunch of symbols. We'll be making good use of this font in future columns when we begin exploring the use of Unicode in graphical programs. For now, we're simply trying to nail down the mechanics of using wide characters in a C program with the C runtime library functions. MAINTAINING A SINGLE SOURCE There are, of course, certain disadvantages to using Unicode. First and foremost is that every string in your program will occupy twice as much space. In addition, you'll observe that the functions in the wide character runtime library are larger than the usual functions. For this reason, you might want to create two versions of a program--one for a U.S. market that works strictly with ASCII and another for an international market that uses Unicode. The best solution would be to maintain a single source code file that you could compile for either ASCII or Unicode. That's a bit of a problem, though, because the runtime library functions have different names, you're defining characters differently, and then there's that nuisance of preceding the string literals with an L. One answer is to use the TCHAR.H header file supplied with Microsoft's 32-bit C compiler. (My speculation is that the T of TCHAR stands for text.) This header file is not part of the ANSI standard, so every function and macro defined therein is preceded by an underscore. TCHAR.H provides a set of alternative names for the normal runtime library functions requiring string parameLB<ters; for example , tprintf and , tcslen. If an identifier called , UNICODE is defined and the TCHAR.H header file is included in your program, then , tprintf is defined to be wprintf: #define , tprintf wprintf If not, then , tprintf is defined to be printf: #define , tprintf printf And so on. TCHAR.H also solves the problem of the two character data types with a new data type named TCHAR. If the , UNICODE identifier is defined, then TCHAR is wchar, t: typedef wchar, t TCHAR ; Otherwise, TCHAR is simply a char: typedef char TCHAR ; Now it's time to address that L problem. If the , UNICODE identifier is defined, then a macro called , , T is defined like this: #define , , T(x) L##x That pair of number signs is called a "token paste" and causes the letter L to be appended to the macro parameter. If the , UNICODE identifier is not defined, the , , T macro is simply defined in the following way: #define , , T(x) x Regardless, two other macros are defined to be the same as , , T: #define , T(x) , , T(x) #define , TEXT(x) , , T(x) Which you use will depend on how concise or verbose you would like to be. Basically, you must define your string literals inside the , T or , TEXT macro in the following way: , TEXT ("Hello, world!") ; This causes the string to be interpreted as composed of wide characters if the , UNICODE identifier is defined, and as 8-bit characters if not. Let's test it out with a single source code module named UNITEST3.C. Figures 7 and 8 show two make files, one for creating the ASCII version of the program (UNITESTA) and the other for the Unicode version (UNITESTW). UNITESTA.MAK compiles UNITEST3.C, shown in Figure 9, to create an object module named UNITESTA.OBJ. (Note that the compile command line uses the -Fo option to give the object file a different name than the source code file.) The UNITESTA.OBJ file is linked to create UNITESTA.EXE. UNITESTW.MAK is similar, except that the compile line also uses the -D (define) option to define the identifier , UNICODE. UNITEST3 displays only one set of output lines. The printf and wprintf functions have been replaced with , tprintf. The strlen and wcslen functions have been replaced with , tcslen. The definition of the character string now uses the TCHAR data type. All character strings are enclosed in the , TEXT macro. Note that the program includes both WCHAR .H and TCHAR.H. The output from UNITESTA.EXE and UNITESTW .EXE is identical except for the line that reports the number of bytes occupied by the string in memory. WHAT HAVE WE LEARNED? We've seen how to use both ASCII strings and Unicode strings in the same source code file, and how to have a single source code file that can be compiled for either ASCII or Unicode. Of course what I've discussed in this column doesn't represent the extent of converting an existing program to Unicode. You'll have to find any places in your code where you've previously assumed that the size of a character is a byte or where you access a binary buffer or file as if it were a collection of characters. (You can download the UNITEST program and source code file from PC MagNet's Programming Forum, archived as UNI.ZIP.) In the next installment of this column, we'll examine how the Windows NT API provides methods for using ASCII and Unicode in the same program, or for creating a single source that can be compiled for either. The methods are a little different from those provided by the C compiler, C runtime library, and header files, but as you will see, the results are similar. +--------------------------------------------------------------------------+ | Executive Editor: Sifeng Ma (U.S.A.) | +--------------------------------------------------------------------------+ | CINET-L (China's InterNET Tech Forum) is a non-public discussion list, | | however, CINET-EDITOR@CND.ORG welcomes contributions on networking in | | China. Some related discussions may be found on CHINANET@TAMVM1.TAMU.EDU | | To join the forum CHINANET@TAMVM1.TAMU.EDU (or CHINANET@TAMVM1.BITNET), | | send a mail to LISTSERV@TAMVM1.TAMU.EDU or LISTSERV@TAMVM1.BITNET | | (Note: NOT CHINANET@TAMVM1) with FIRST LINE of the mail body as follows: | | SUB CHINANET Your_First_Name Last_Name | +--------------------------------------------------------------------------+ ------- End of Forwarded Message --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Friday, 5 November 1993 01:14:37 UTC