RE: Comments on EXI from Taki Kamiya on 2008-06-20 (public-exi@w3.org from June 2008)

From: Taki Kamiya <tkamiya@us.fujitsu.com>
Date: Fri, 20 Jun 2008 11:40:28 -0700
To: "'Philip Boutros'" <phil.boutros@oracle.com>, <public-exi@w3.org>
Cc: <jeff.mischkinsky@oracle.com>
Message-ID: <5F2E139ABB994BD187D73923DD274DA1@catarojp>
Hi Philip,

Thank you for your insightful comments on the specification,
and our apologies for long belated response.

As you may haved noticed as stated in an editorial note in 
the 3rd working draft, we will add support for a magic cookie
at the very front of EXI streams. We expect to publish an
update in a couple of weeks, so please stay tuned.

As for the use of schemaID and instance-schema matching
integrity, the spec is intentionally made agnostic about it.
It is the presupposition that it is up to use cases, applications,
or other specifications that leverage EXI format to define the syntax 
and semantics of the schemaID field, that has led to the approach.
There are cases where strings of a couple of characters length 
would be used as schemaID, whereas URIs may be suited in some 
other use cases. We also feel that it is out of the scope of the format 
specification to define a single mechanism to assure the matching 
of instances and schemas. Again, it's up to use cases to define 
schema identity in connection  with their own schemaID semantics. 
Either meta-data managed out  of bound, or [user defined] header 
options field could be used for assuring the level of schema identity 
that each use case require for integrity. We are going to make this
point clearer in the next update of the specification.

Thank you,

-taki


-----Original Message-----
From: public-exi-request@w3.org [mailto:public-exi-request@w3.org] On Behalf Of Philip Boutros
Sent: Thursday, January 17, 2008 9:00 AM
To: public-exi@w3.org
Cc: jeff.mischkinsky@oracle.com
Subject: Comments on EXI

> Hello all
> Just a few comments on EXI...
>  
> 1) Magic value and version number
> 
> I'd like to second Carl Eric Codère's (and others) request for a reasonably 
> unique magic value in the header. It is easy to view EXI as a solely "in the pipe" 
> format as the TAG would like us to. I believe however that if this format reaches 
> recommendation and is adopted by vendors it's very likely to be persisted to 
> storage media in the form of standalone files. Many of those files will, over 
> time, become disconnected from the application/context in which they were 
> stored and thereby lose any external metadata marking them as EXI. Ten years 
> from now applications that support multiple file types must have a reasonably 
> conclusive way to identify standalone EXI files outside of any identifying context.
> 
> There are also shorter term use cases where EXI files that are still tightly 
> bound to their applications may need to be identified by external processes. 
> For example, an indexing application that is not tightly coupled with the 
> application storing the EXI would still like to quickly identify and index 
> the EXI. Indexing appliances running against file shares, BLOB indexing in 
> databases, etc. are examples where this might apply.
> 
> As to the implementation, I don't believe it is necessary to use 8 bytes as 
> JPEG2000 and PNG do in order to catch CR/LF problems that should be a thing of 
> the past (I hope). 4 bytes would suffice. 30 extra bits seems a small price to 
> pay for the long term ability to identify this format.
> 
> In addition, the bitwise nature of EXI's current header and the slightly 
> complex way the version number is stored seems to me to be a very small size 
> saving at the cost of simplicity and clarity. Some developer out there will 
> either be lazy or get it wrong. Do we really want Windows 2020 or Linux 10 
> blowing up when the EXI version number goes from 15 to 16? 
> 
> If it were me...
> 
> 4 byte magic value = 0x211 'E' 'X' 'I' or whatever
> 2 byte version number (including high bit for 'preview')
> 1 byte flags (only 0x01 defined as presence of EXI options)
> 
> ....resulting in 6 extra header bytes but 1) is identifiable 2) is trivial to 
> read, write, generate by hand, etc. and 3) provides 7 extra bits just in case.
> 
> 
> 2) Clarity on EXI Options options (not a typo)
> 
> The section on EXI Options might make it even more clear that the options are 
> an EXI Body only and MUST NOT contain EXI Options themselves. 
> 
> 
> 3) Is schemaID enough?
> 
> It worries me that a decoder has no way to guarantee that the schema it is 
> using to decode a schema-informed EXI is the exact one that was used to encode 
> the original XML. Even more worrying is the fact that a mismatch between the 
> encoding and decoding schemas can result in valid XML but incorrect information. 
> For example, lets say I have an schema-informed EXI encoded with a schema that 
> includes this simple type...
> 
>   <xs:simpleType name="Status">
>     <xs:restriction base="xs:NCName">
>       <xs:enumeration value="CLOSED" />
>       <xs:enumeration value="OPEN" />
>       <xs:enumeration value="REOPENED" />
>     </xs:restriction>
>   </xs:simpleType>
> 
> ....but the schema I am using to decode was updated (over time, uninformed 
> developer, etc.) to this...
> 
>   <xs:simpleType name="Status">
>     <xs:restriction base="xs:NCName">
>       <xs:enumeration value="CLOSED" />
> 	<xs:enumeration value="FIXED" />
>       <xs:enumeration value="OPEN" />
>       <xs:enumeration value="REOPENED" />
>     </xs:restriction>
>   </xs:simpleType>
> 
> Using the updated schema to decode the EXI file will produce no errors or 
> warning of any sort but every item marked as OPEN in the original XML will 
> show up as FIXED. In addition, I believe that according to the XML Schema 
> specification (please correct me if I'm wrong) that this simple type...
> 
>   <xs:simpleType name="Status">
>     <xs:restriction base="xs:NCName">
>       <xs:enumeration value="REOPENED" />
>       <xs:enumeration value="CLOSED" />
>       <xs:enumeration value="OPEN" />
>     </xs:restriction>
>   </xs:simpleType>
> 
> ....is equivalent (in the XML Schema sense) to the original ("enumeration 
> does not impose an order relation on the value space it creates" XML Schema 
> Part 2: Datatypes Second Edition) thus bringing up the case that schemas that 
> are equivalent for XML may be different for EXI. Other examples of this 
> problem include element names, restricted character sets, bounded integers, etc.
> 
> The integrity of the information then rests on some external guarantee that 
> the schemaID option (a single string) is sufficient to allow the decoder to 
> find the original schema across an arbitrary span of time and space. While 
> I recognize that some schemas, like http://www.w3.org/2001/XMLSchema, will 
> be stable effectively forever, I don't believe a URI match is sufficient 
> guarantee in the more general case. I'm very concerned that a more rigorous 
> method to guarantee a schema match is not part of this specification. 
> I haven't done enough thinking to recommend a complete solution but for XML 
> Schema, RelaxNG or any other XML based schema language I think a hash of the 
> EXI of the schema would most likely work. 
> 
> Thanks
> Philip J Boutros | Vice President, Software Development
> Oracle Development
> phil.boutros@oracle.com
> 
>
Received on Friday, 20 June 2008 18:42:59 UTC