W3C home > Mailing lists > Public > public-exi-comments@w3.org > September 2009

Re: "RE: "Request for response to original XML Core WG comments""

From: John Cowan <cowan@ccil.org>
Date: Fri, 25 Sep 2009 11:38:35 -0400
To: Paul Pierce <prp@teleport.com>
Cc: "Michael S. Cokus" <msc@mitre.org>, EXI Comments <public-exi-comments@w3.org>, "public-xml-core-wg@w3.org" <public-xml-core-wg@w3.org>
Message-ID: <20090925153835.GH11225@mercury.ccil.org>
Paul Pierce scripsit:

> Is it possible that the languages that are inefficiently coded in
> UTF-8 work better in UTF-16? A lot of XML documents are coded in either
> UTF-8 or UTF-16, plus some heavily used programming languages use UTF
> string encoding natively. It would be very cool if EXI processors
> could move character data straight across. EXI could have a single
> bit to indicate either UTF-8 or UTF-16, corresponding to this common
> subset of the XML encoding declaration.

I think that's a bad idea for a number of reasons:

1) The more optionality there is in the system, the slower and more
complicated encoders and decoders have to be.  In particular, an encoder
will have to be quite smart about choosing UTF-8 vs. UTF-16, given that
its input will probably have been normalized by the XML parser front-end
into one or the other.  Even in decoding, options = conditional branches =
slow processing on modern CPUs.

2) UTF-16 by itself supports all modern scripts equally well;
unfortunately, it also supports them equally badly.  The space cost of
using it is high, even in non-Latin texts, because most scripts use
the ASCII space, and essentially all use the various ASCII newline
conventions.

3) The current EXI design uses their convention for arbitrary-size
unsigned integers, which all EXI encoders and decoders must be able to
deal with.  Having to support one or two alternative conventions for
encoding integers less than 2^21 leads to code bloat.

My original argument was that UTF-8 is as good as EXI in space terms.
It's not (a wide variety of scripts use 2 bytes per character in EXI,
3 bytes per character in UTF-8) and that's that.

-- 
John Cowan      http://www.ccil.org/~cowan      cowan@ccil.org
Be yourself.  Especially do not feign a working knowledge of RDF where
no such knowledge exists.  Neither be cynical about RELAX NG; for in
the face of all aridity and disenchantment in the world of markup,
James Clark is as perennial as the grass.  --DeXiderata, Sean McGrath
Received on Friday, 25 September 2009 15:39:15 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 19:45:28 UTC