- From: John Cowan <cowan@ccil.org>
- Date: Fri, 25 Sep 2009 11:38:35 -0400
- To: Paul Pierce <prp@teleport.com>
- Cc: "Michael S. Cokus" <msc@mitre.org>, EXI Comments <public-exi-comments@w3.org>, "public-xml-core-wg@w3.org" <public-xml-core-wg@w3.org>
Paul Pierce scripsit: > Is it possible that the languages that are inefficiently coded in > UTF-8 work better in UTF-16? A lot of XML documents are coded in either > UTF-8 or UTF-16, plus some heavily used programming languages use UTF > string encoding natively. It would be very cool if EXI processors > could move character data straight across. EXI could have a single > bit to indicate either UTF-8 or UTF-16, corresponding to this common > subset of the XML encoding declaration. I think that's a bad idea for a number of reasons: 1) The more optionality there is in the system, the slower and more complicated encoders and decoders have to be. In particular, an encoder will have to be quite smart about choosing UTF-8 vs. UTF-16, given that its input will probably have been normalized by the XML parser front-end into one or the other. Even in decoding, options = conditional branches = slow processing on modern CPUs. 2) UTF-16 by itself supports all modern scripts equally well; unfortunately, it also supports them equally badly. The space cost of using it is high, even in non-Latin texts, because most scripts use the ASCII space, and essentially all use the various ASCII newline conventions. 3) The current EXI design uses their convention for arbitrary-size unsigned integers, which all EXI encoders and decoders must be able to deal with. Having to support one or two alternative conventions for encoding integers less than 2^21 leads to code bloat. My original argument was that UTF-8 is as good as EXI in space terms. It's not (a wide variety of scripts use 2 bytes per character in EXI, 3 bytes per character in UTF-8) and that's that. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Be yourself. Especially do not feign a working knowledge of RDF where no such knowledge exists. Neither be cynical about RELAX NG; for in the face of all aridity and disenchantment in the world of markup, James Clark is as perennial as the grass. --DeXiderata, Sean McGrath
Received on Friday, 25 September 2009 15:39:15 UTC