- From: Carl W. Brown <cbrown@xnetinc.com>
- Date: Wed, 15 Aug 2001 14:10:24 -0700
- To: <www-international@w3.org>
Andrea, > -----Original Message----- > From: www-international-request@w3.org > [mailto:www-international-request@w3.org]On Behalf Of A. Vine > Sent: Wednesday, August 15, 2001 10:40 AM > To: www-international@w3.org > Subject: Re: Unicode <-> CJKV national encoding; supporting > multi-lingual webcontent > > > > "Carl W. Brown" wrote: > > > > The only way to sanely implement a multi-lingual site is using > Unicode. The > > best support for Unicode is ICU. > http://oss.software.ibm.com/icu/ If you > > or Java :-) Java has it own set of problems (challenges). First many people already have code written in C that they do not want to rewrite. It is not as easy to actually get to the underlying Unicode in Java. C code usually runs faster. Then there is the problem of JVM versions and conflicting support. ICU started with the Java Unicode support and adapted it for C/C++ applications. http://oss.software.ibm.com/icu/ You will notice that many of the functions are very Javaesque. It is a great component library for Unicode. What I have added is extra functions that don't really belong in ICU proper. xIUA unlike ICU is designed as a sample starting point for code that you develop as part of your application. http://www.xnetinc.com/xiua/ While it is designed for typical applications it is especially useful for web server applications. For example is adds support for a per-thread set of locales. You can have one locale for the browser, one for your HTML pages and one for Your Unicode database. You can make calls to transform your data from your HTML charset which may be EUC-JP to your browser charset which is Shift_JIS. The same code may convert the same page to UTF-8 for the next browser. If you are parsing data your code can call xiua_strtok and the same call will work for UTF-32, UTF-16, UTF-8 and code page data. Unlike the normal strtok it is also thread safe. It manages your locale information including time zones using Java style time zones. It also has special web functions. For example it will analyze a browser accept language string including the q= quality selections and return the first choice language based on the installed ICU locales. It will also analyze a path and return any RFC 3066 language subdirectory name that is found to match your ICU installed locales. It also has special migration aids. It has a routine for example. that will convert a strftime date time format to an ICU format using the ICU values from its resource bundles. It also makes conversion easier because like Java you don't have to pass the locale to every function that may invoke ICU so that you don't have to change any existing APIs to convert to Unicode. This code is really a starting point for users. It is designed to be customized by users. It also has alternative functions. For example I have xiua_strcoll that most C programmers can relate to but since the first implementation was in a special version of PHP, I also have: int32_t /* 1 = TRUE, 0 = FALSE, -1 = LOGIC ERROR */ xiua_Collate(char *str1, /* string 1 */ char * option, /* option string contains both comparison test */ /* and optional collation strength parameters */ /* "==" "<=" ">=" "!=" "<" ">" are the */ /* comparison test values and "?" ":" "#" are */ /* the valid strength codes. "==?" is a test */ /* for equal primary strength. */ /* ? = Primary letters match no case or case */ /* e.g "Black-bird" ==? "blackbird" */ /* but what consitiute separate letters may differ */ /* by locale e.g. Spanish ch ll */ /* Secondary case insensitive normalized with accents */ /* : = Tertiary above plus case sensitive */ /* # = Strict match */ /* spaces are ignored, non-standard conditions are */ /* supported "!<>" or "=" are the same as "==" */ /* "" or "!" however are illogical and are errors */ char * str2); /* string 2 */ Because the result is a TRUE/FALSE it is easy to embed the result into a more complex test or a regular expression. It also has i18n useful functions such as xiua_strncpyEx. It works somewhat like strncpy except that it always adds a null to the end of the target string and only copies full characters. So if it is copying UTF-8 or Shift_JIS data you will always get full characters copied even if it means that the target buffer is not quite full. It also always adds a null to the end which is 4 bytes if it is UTF-32 data. To make it easier to use it returns the data length copied. > > Also, if you're generating HTML forms, I recommend you take a look at: > > http://www.unicode.org/iuc/iuc17/papers.html Good presentation. One thing that my software does is help overcome some of the problems with internationalization that you have with the Apache web server that you don't have with servers like iPlanet but yet stay iPlanet compatible. I think that adding language as a type is an administrative nightmare. www.mysite.com/dir/subdir/mypage.html.en is a bad idea. If nothing else it creates problems keeping track of links. It is also a maintenance nightmare. If I also have www.mysite.com/dir/subdir/mypage.html.jp with is encoded in sjis mixed in with pages in other encodings. It is easy to get your hands crossed. > > Look at the presentation under Session A2 (mine ;-) and it looks like the > presentation under A3 might have some useful information (David Taieb's). > > Regards, > Andrea > Carl
Received on Wednesday, 15 August 2001 17:10:58 UTC