Re: "Request for response to original XML Core WG comments"

Cokus, Michael S. scripsit:

(Personal response, not approved by the XML Core WG)

> > 7) We believe that the current representation of strings has no
> > material advantage over UTF-8, since although it uses at most 3 bytes
> > per character, 4-byte UTF characters are very rare except in documents
> > written in obsolete scripts.
> 
> In our initial response we noted that a number of languages in common
> use are represented in UTF using 4 bytes.

Actually you said no such thing.  What you wrote was:

        E.g., there is a range of code points where EXI uses 2 bytes,
        versus 3 for UTF-8. Any content in such scripts would therefore
        be 50% larger in UTF-8 vs. current EXI. This would include the
        Devanagari scripts (used in several Indic languages, including
        Hindi), Thai, Hangul Jamo (but not Hangul syllables; Korea),
        Hiragana and Katakana (but not Kanji/CJK unified, Japan).

This argument is correct, and I didn't challenge it.  You then added:

        The EXI WG can't endorse the rarity claim, as these scripts appear
        to be in daily use by easily over one billion people with little
        observable tendencies to obsolete any of them.

The "rarity claim" was for scripts using characters from U+10000 up,
which require four bytes in UTF-8 but only three in EXI.  Your examples
were for the range U+0800 to U+3FFF, which require three bytes in UTF-8
but only two in EXI.

In any case, I don't propose you do anything; this is just to correct
the record.

-- 
John Cowan   http://ccil.org/~cowan  cowan@ccil.org
[P]olice in many lands are now complaining that local arrestees are insisting
on having their Miranda rights read to them, just like perps in American TV
cop shows.  When it's explained to them that they are in a different country,
where those rights do not exist, they become outraged.  --Neal Stephenson

Received on Thursday, 24 September 2009 21:37:40 UTC