"RE: "Request for response to original XML Core WG comments"" from Paul Pierce on 2009-09-24 (public-exi-comments@w3.org from September 2009)

From: Paul Pierce <prp@teleport.com>
Date: 24 Sep 2009 21:08:59
To: "Michael S. Cokus" <msc@mitre.org>, "EXI Comments" <public-exi-comments@w3.org>
Cc: "public-xml-core-wg@w3.org" <public-xml-core-wg@w3.org>
Message-ID: <paul.20090825t213517z.9@msi945g3/192.168.1.127>

Michael,

Thank you. I would like further discussion on the following two (in addition to the ongoing IEEE floating point discussion, where I look forward to seeing the relevant test results.)


> > 7) We believe that the current representation of strings has no
> > material advantage over UTF-8, since although it uses at most 3 bytes
> > per character, 4-byte UTF characters are very rare except in documents
> > written in obsolete scripts.
> 
> In our initial response we noted that a number of languages in common use are
> represented in UTF using 4 bytes.  So we concluded that the EXI design (which
> uses 3 bytes) would result in significant savings in size.  To our knowledge,
> there were no further questions/responses concerning this comment.

Is it possible that the languages that are inefficiently coded in UTF-8 work better in UTF-16? A lot of XML documents are coded in either UTF-8 or UTF-16, plus some heavily used programming languages use UTF string encoding natively. It would be very cool if EXI processors could move character data straight across. EXI could have a single bit to indicate either UTF-8 or UTF-16, corresponding to this common subset of the XML encoding declaration.

If UTF-16 isn't good enough, is there another relatively simple way to import a subset of the XML encoding declaration into EXI in such a way that most characters can travel between EXI and XML or across API's without translation?

> 
> 
> > 8) We are strongly concerned about the concept of pluggable
> > codecs as a barrier to interoperability, and believe that the
> > draft should contain a strong health warning about the use of
> > these: they should be used only in cases where there is explicit
> > agreement between the communicating parties, and never for
> > documents intended for consumption by a general audience.
> 
> We agree and said as much in our initial response.  A note has been placed
> in section 7.4 "Data Representation Map" to address this:
> 
> http://www.w3.org/TR/2008/WD-exi-20080919/#datatypeRepresentationMap
> 

I would very much like to see this pluggable codec/user datatype feature disappear altogether. It is already effectively present in schema and need not be duplicated in EXI. Leaving it to schema would make EXI more like XML and would be, I think, better design in having good separation of function.

So I guess I'm asking for a robust case for its existence, beyond the few cases already discussed (e.g. floating point) where the standard leans on user datatypes to support a standard representation in liu of the default EXI specific representation. What use cases require user datatypes and why can't they use schema? Are there other considerations?

Paul

Received on Thursday, 24 September 2009 21:46:43 UTC