- From: Carl W. Brown <cbrown@xnetinc.com>
- Date: Fri, 28 Sep 2001 15:02:26 -0700
- To: <www-international@w3.org>
Merel, > -----Original Message----- > From: www-international-request@w3.org > [mailto:www-international-request@w3.org]On Behalf Of Merle Tenney > Sent: Friday, September 28, 2001 1:21 PM > To: 'Carl W. Brown'; www-international@w3.org > Subject: RE: utf-8 Locale support on Solaris and Linux > > > Carl, > > > > 3. Most ICU interfaces do not take UTF-8 strings but rather UTF-16 > > > strings (which are 16-bits wide). > > > > ICU has some macros for UTF-8 support but you have to look at them > > carefully. They were added to ICU because they do not add to > > the code size. > > They are not a complete UTF-8 support package. There are > > some that I use > > but others can get you into trouble. We just had a > > discussion on the use of > > such macros to count the number of characters in a string. > > There are two > > classes of support macros SAFE and UNSAFE. The SAFE validate > > the data and > > the UNSAFE which run faster do not. Using either macro in a > > routine will > > produce a bad count if the data is bad. The count may differ > > depending on > > the choice but neither will give you any indication that the > > count is wrong. > > > > In my humble opinion, you are better off implementing your own > > routines for > > many of these functions. They can be faster and be more > > reliable. This is > > the one area they I feel that ICU would have been better off > > in just not > > trying to do a half done job. In all other areas the ICU > > code is top notch. > > This is good advice, but it leads naturally to another question: Why > doesn't ICU have a branch that provides equivalent support to the existing > code, but for text encoded in UTF-8? I know that you can convert easily > between UTF-8 and UTF-16, but you really want to have a system that is > designed, optimized, and tested for your native encoding. I believe that if you want to do something like collation that it is more efficient to convert the UTF-8 into UTF-16 or UTF-32 because of the intense character handling. The same applies to case shifting as well but because of the double transformation and less character handling I suspect that it probably is closer to a draw. Other routines like strstr and strtok must have separate implementations for UTF-8, UTF-16 & UTF-32. I use a stateless fast transform that is very efficient with lots of little fields so the overhead is minimal. It is also totally transparent to the user. > There > are a *lot* > of Unicode implementations that will be based on UTF-8, so I don't think > this is an unusual request. Has this been considered before? I started on the original code that eventually became xIUA in Feb 2000. One of the first hinks I added was UTF-8 support for ICU. > Would it take > a lot of work to complement the existing ICU libraries with native UTF-8 > versions and maintain them in parallel? Seen the growing interest in UTF-8 was one of the factors that lead me to make xIUA available to the public as a cross platform free open source addition to ICU. You can just use the UTF-8 support out of it but I found that web applications were especially tricky in that they could both use UTF-8 as a Unicode encoding and as a code page. To make that easy for programmers it needed a little extra infrastructure. For example my browser may use any character set for one of many code page to UTF-8. I need an application that has transparent code page MBCS support and UTF-8 support. So I can use iso-5589-1 to browse a site and if it contains not translatable characters such as the Euro character that they will be automatically inserted in the text as NCRs and the same code will see that if the browser is using a UTF-8 code page and the database is UTF-8 then it just has to copy the data. I also throw in the code that will check the browser accept language, a routine to convert it to locale, one to take the accept char set and pick the best character set to build a full locale for the browser. Code like this needs to be customized to the application in fact xIUA is designed to be integrated into the application. You next point was how do you maintain this? In a year and a half I have only made minor changes. I tightened up the UTF-8 validation to meet Unicode 3.1 changes and I had to change the collation interface because of changes in ICU between 1.6 and 1.8. The other change I made was I changed the string null termination handling to be more compatible with the new ICU 2.0 standards. In a year and a half there were only a few lines of code that needed to be changed. The other changes were improvements. By running the xIUA test program with a new ICU release I can find out very quickly if I need to change anything and exactly what needs changing. Carl
Received on Friday, 28 September 2001 18:02:31 UTC