RE: "Request for response to original XML Core WG comments" from Taki Kamiya on 2009-09-27 (public-xml-core-wg@w3.org from September 2009)

From: Taki Kamiya <tkamiya@us.fujitsu.com>
Date: Sat, 26 Sep 2009 20:17:56 -0700
To: "'John Cowan'" <cowan@ccil.org>, "'Cokus, Michael S.'" <msc@mitre.org>
Cc: <public-exi-comments@w3.org>, <public-xml-core-wg@w3.org>
Message-ID: <DBAECAE9CA834066BEFF91CBA53397F3@homunculus>

Hi John,

Thanks for setting the record straight.

To just follow up your response, I just wanted to keep it in perspective by
pointing out that, more Kanjis (Han characters used in Japanese scripts)
are being added and creeping into non-BMP planes lately as the
government and institutions make steady efforts to represent person names
and place names accurately in the Unicode system. I am not attempting to
refute the 4-byte UTF-8 rarity argument, but the ones I know being added
are nonetheless not obsolete yet, for its worth where they are used...

Anyway, thanks for noticing it, and make it accurate.

-taki

-----Original Message-----
From: public-exi-comments-request@w3.org [mailto:public-exi-comments-request@w3.org] On Behalf Of John Cowan
Sent: Thursday, September 24, 2009 2:37 PM
To: Cokus, Michael S.
Cc: Paul Pierce; EXI Comments; public-xml-core-wg@w3.org
Subject: Re: "Request for response to original XML Core WG comments"

Cokus, Michael S. scripsit:

(Personal response, not approved by the XML Core WG)

> > 7) We believe that the current representation of strings has no
> > material advantage over UTF-8, since although it uses at most 3 bytes
> > per character, 4-byte UTF characters are very rare except in documents
> > written in obsolete scripts.
>
> In our initial response we noted that a number of languages in common
> use are represented in UTF using 4 bytes.

Actually you said no such thing.  What you wrote was:

        E.g., there is a range of code points where EXI uses 2 bytes,
        versus 3 for UTF-8. Any content in such scripts would therefore
        be 50% larger in UTF-8 vs. current EXI. This would include the
        Devanagari scripts (used in several Indic languages, including
        Hindi), Thai, Hangul Jamo (but not Hangul syllables; Korea),
        Hiragana and Katakana (but not Kanji/CJK unified, Japan).

This argument is correct, and I didn't challenge it.  You then added:

        The EXI WG can't endorse the rarity claim, as these scripts appear
        to be in daily use by easily over one billion people with little
        observable tendencies to obsolete any of them.

The "rarity claim" was for scripts using characters from U+10000 up,
which require four bytes in UTF-8 but only three in EXI.  Your examples
were for the range U+0800 to U+3FFF, which require three bytes in UTF-8
but only two in EXI.

In any case, I don't propose you do anything; this is just to correct
the record.

--
John Cowan   http://ccil.org/~cowan  cowan@ccil.org
[P]olice in many lands are now complaining that local arrestees are insisting
on having their Miranda rights read to them, just like perps in American TV
cop shows.  When it's explained to them that they are in a different country,
where those rights do not exist, they become outraged.  --Neal Stephenson

Received on Sunday, 27 September 2009 23:00:46 UTC