Comments on EXI from Philip Boutros on 2008-01-17 (public-exi@w3.org from January 2008)

From: Philip Boutros <phil.boutros@oracle.com>
Date: Thu, 17 Jan 2008 11:00:09 -0600
To: "public-exi@w3.org" <public-exi@w3.org>
CC: "jeff.mischkinsky@oracle.com" <jeff.mischkinsky@oracle.com>
Message-ID: <20080117110009046.00000003500@home-n71x9tee73>
Hello all
Just a few comments on EXI...
 
1) Magic value and version number

I'd like to second Carl Eric Codère's (and others) request for a reasonably unique magic value in the header. It is easy to view EXI as a solely "in the pipe" format as the TAG would like us to. I believe however that if this format reaches recommendation and is adopted by vendors it's very likely to be persisted to storage media in the form of standalone files. Many of those files will, over time, become disconnected from the application/context in which they were stored and thereby lose any external metadata marking them as EXI. Ten years from now applications that support multiple file types must have a reasonably conclusive way to identify standalone EXI files outside of any identifying context.

There are also shorter term use cases where EXI files that are still tightly bound to their applications may need to be identified by external processes. For example, an indexing application that is not tightly coupled with the application storing the EXI would still like to quickly identify and index the EXI. Indexing appliances running against file shares, BLOB indexing in databases, etc. are examples where this might apply.

As to the implementation, I don't believe it is necessary to use 8 bytes as JPEG2000 and PNG do in order to catch CR/LF problems that should be a thing of the past (I hope). 4 bytes would suffice. 30 extra bits seems a small price to pay for the long term ability to identify this format.

In addition, the bitwise nature of EXI's current header and the slightly complex way the version number is stored seems to me to be a very small size saving at the cost of simplicity and clarity. Some developer out there will either be lazy or get it wrong. Do we really want Windows 2020 or Linux 10 blowing up when the EXI version number goes from 15 to 16? 

If it were me...

4 byte magic value = 0x211 'E' 'X' 'I' or whatever
2 byte version number (including high bit for 'preview')
1 byte flags (only 0x01 defined as presence of EXI options)

....resulting in 6 extra header bytes but 1) is identifiable 2) is trivial to read, write, generate by hand, etc. and 3) provides 7 extra bits just in case.


2) Clarity on EXI Options options (not a typo)

The section on EXI Options might make it even more clear that the options are an EXI Body only and MUST NOT contain EXI Options themselves. 


3) Is schemaID enough?

It worries me that a decoder has no way to guarantee that the schema it is using to decode a schema-informed EXI is the exact one that was used to encode the original XML. Even more worrying is the fact that a mismatch between the encoding and decoding schemas can result in valid XML but incorrect information. For example, lets say I have an schema-informed EXI encoded with a schema that includes this simple type...

  <xs:simpleType name="Status">
    <xs:restriction base="xs:NCName">
      <xs:enumeration value="CLOSED" />
      <xs:enumeration value="OPEN" />
      <xs:enumeration value="REOPENED" />
    </xs:restriction>
  </xs:simpleType>

....but the schema I am using to decode was updated (over time, uninformed developer, etc.) to this...

  <xs:simpleType name="Status">
    <xs:restriction base="xs:NCName">
      <xs:enumeration value="CLOSED" />
	<xs:enumeration value="FIXED" />
      <xs:enumeration value="OPEN" />
      <xs:enumeration value="REOPENED" />
    </xs:restriction>
  </xs:simpleType>

Using the updated schema to decode the EXI file will produce no errors or warning of any sort but every item marked as OPEN in the original XML will show up as FIXED. In addition, I believe that according to the XML Schema specification (please correct me if I'm wrong) that this simple type...

  <xs:simpleType name="Status">
    <xs:restriction base="xs:NCName">
      <xs:enumeration value="REOPENED" />
      <xs:enumeration value="CLOSED" />
      <xs:enumeration value="OPEN" />
    </xs:restriction>
  </xs:simpleType>

....is equivalent (in the XML Schema sense) to the original ("enumeration does not impose an order relation on the value space it creates" XML Schema Part 2: Datatypes Second Edition) thus bringing up the case that schemas that are equivalent for XML may be different for EXI. Other examples of this problem include element names, restricted character sets, bounded integers, etc.

The integrity of the information then rests on some external guarantee that the schemaID option (a single string) is sufficient to allow the decoder to find the original schema across an arbitrary span of time and space. While I recognize that some schemas, like http://www.w3.org/2001/XMLSchema, will be stable effectively forever, I don't believe a URI match is sufficient guarantee in the more general case. I'm very concerned that a more rigorous method to guarantee a schema match is not part of this specification. I haven't done enough thinking to recommend a complete solution but for XML Schema, RelaxNG or any other XML based schema language I think a hash of the EXI of the schema would most likely work. 

Thanks
Philip J Boutros | Vice President, Software Development
Oracle Development
phil.boutros@oracle.com
Received on Thursday, 17 January 2008 19:21:51 UTC