- From: Taki Kamiya <tkamiya@us.fujitsu.com>
- Date: Sat, 26 Sep 2009 20:17:56 -0700
- To: "'John Cowan'" <cowan@ccil.org>, "'Cokus, Michael S.'" <msc@mitre.org>
- Cc: <public-exi-comments@w3.org>, <public-xml-core-wg@w3.org>
Hi John, Thanks for setting the record straight. To just follow up your response, I just wanted to keep it in perspective by pointing out that, more Kanjis (Han characters used in Japanese scripts) are being added and creeping into non-BMP planes lately as the government and institutions make steady efforts to represent person names and place names accurately in the Unicode system. I am not attempting to refute the 4-byte UTF-8 rarity argument, but the ones I know being added are nonetheless not obsolete yet, for its worth where they are used... Anyway, thanks for noticing it, and make it accurate. -taki -----Original Message----- From: public-exi-comments-request@w3.org [mailto:public-exi-comments-request@w3.org] On Behalf Of John Cowan Sent: Thursday, September 24, 2009 2:37 PM To: Cokus, Michael S. Cc: Paul Pierce; EXI Comments; public-xml-core-wg@w3.org Subject: Re: "Request for response to original XML Core WG comments" Cokus, Michael S. scripsit: (Personal response, not approved by the XML Core WG) > > 7) We believe that the current representation of strings has no > > material advantage over UTF-8, since although it uses at most 3 bytes > > per character, 4-byte UTF characters are very rare except in documents > > written in obsolete scripts. > > In our initial response we noted that a number of languages in common > use are represented in UTF using 4 bytes. Actually you said no such thing. What you wrote was: E.g., there is a range of code points where EXI uses 2 bytes, versus 3 for UTF-8. Any content in such scripts would therefore be 50% larger in UTF-8 vs. current EXI. This would include the Devanagari scripts (used in several Indic languages, including Hindi), Thai, Hangul Jamo (but not Hangul syllables; Korea), Hiragana and Katakana (but not Kanji/CJK unified, Japan). This argument is correct, and I didn't challenge it. You then added: The EXI WG can't endorse the rarity claim, as these scripts appear to be in daily use by easily over one billion people with little observable tendencies to obsolete any of them. The "rarity claim" was for scripts using characters from U+10000 up, which require four bytes in UTF-8 but only three in EXI. Your examples were for the range U+0800 to U+3FFF, which require three bytes in UTF-8 but only two in EXI. In any case, I don't propose you do anything; this is just to correct the record. -- John Cowan http://ccil.org/~cowan cowan@ccil.org [P]olice in many lands are now complaining that local arrestees are insisting on having their Miranda rights read to them, just like perps in American TV cop shows. When it's explained to them that they are in a different country, where those rights do not exist, they become outraged. --Neal Stephenson
Received on Sunday, 27 September 2009 03:19:04 UTC