- From: John Schneider <john.schneider@agiledelta.com>
- Date: Wed, 16 Sep 2009 10:12:53 -0700
- To: "'FABLET Youenn'" <Youenn.Fablet@crf.canon.fr>, <public-exi@w3.org>
- Message-ID: <CC667241B4BE4230A888D0ECC9A41FA1@jcsdell8600>
Dear Youenn, Thank you for providing feedback on the EXI specification. You are indeed correct that the current EXI specification's use of the terms "schema-valid" and "schema-invalid" imply some unfortunate and undesirable consequences. As it turns out these consequences were not intended and this is the reason the section regarding restricted character sets defines rules for representing schema-invalid values using a [schema-valid value] representation. Based on your feedback we are updating the specification to use more appropriate terminology. In particular, the terms [schema-valid value] and [schema-invalid value] will become [schema-typed value] and [untyped value] respectively. This, along with some other updates, will eliminate any perceived requirements for schema-validation. The specification will also make it clear that [schema-typed value]s SHOULD be used when the value can be represented using the associated EXI datatype. This, and other updates, will give implementations the flexibility to use [untyped value]s even in cases where [schema-typed values] would have worked. I hope this helps to address your concerns. Thank you again for your comments! Best wishes, John AgileDelta, Inc. <mailto:john.schneider@agiledelta.com> john.schneider@agiledelta.com <http://www.agiledelta.com/> http://www.agiledelta.com _____ From: public-exi-request@w3.org [mailto:public-exi-request@w3.org] On Behalf Of FABLET Youenn Sent: Monday, August 24, 2009 8:07 AM To: 'public-exi@w3.org' Subject: Issue on EXI value validation Dear all, Reading the EXI specification, I have the following issue (two additional related sub-issues are also described below). 1. Schema validation of XML values As per the current specification, when a schema informed grammar is in use, the encoder needs to check the validity of all XML values to retrieve the correct production (section 8.5.4.4.1). AIUI, the validity is computed according the schema in use. This adds a real burden on EXI implementations, mainly in terms of processing efficiency but also in terms of compression: 1) The EXI encoder needs to keep all simple type information present in the schema (increased in-memory schema representation) 2) The EXI encoder needs to implement XSD1.0 part 2, regexp validation notably (increased code footprint) 3) The EXI encoder needs to do schema part 2 validation on all XML values(processing penalty, reg exp for instance) 4) Some invalid values can be correctly encoded using the built-in EXI type (compression penalty) Some examples that illustrate some of the issues: - A string 'abc' would be encoded using a schema-invalid production because its length is 3 and its simple type definition has a length facet of 4. - The float '1.0' would benefit from being encoded as a float even if its simple type states that it must be in the range [0,1.0[ - The integer '19' would be correctly encoded using a 5 bits integer encoding even if its simple type states that the integer range is [0,18]. - A string 'ofo' would benefit from being encoded according the regexp {foo|oof} although it is an invalid value I understand that schema validation is well defined in schema part2 specification and already well deployed but the purpose of EXI is to achieve very good compression not validation. Replacing MUST statements by SHOULD statements (to state that valid values SHOULD be encoded with the schema-valid productions and invalid values SHOULD be encoded with the schema-invalid productions) may be sufficient? Or maybe there is a simple way to redefine the validation criteria in terms of whether a specific codec can actually represent a given XML value (schema-valid production) or not (schema-invalid production)? 2. Restricted charset behavior Section 7.1.10.1 states that string characters not present in the restricted charset may be encoded using a specific technique. According section 8.5.4.4.1, restricted charset encoding will be limited to valid values. Schema-valid values will contain only characters in the restricted charset and whitespaces. I am therefore wondering whether the ability to encode characters not in the restricted charset is restrained to whitespaces only or if its purpose is larger. 3. Automatic selection of productions The use of restricted charset may be useful when compressing valid values but also invalid values. But compression may hurt from invalid values that have few or no characters in the charset (and we cannot always change the schema to improve the situation). While it can be difficult to draw a specific line in the specification, the best decision can always be done by the encoder which has the knowledge of both the string to encode and the restricted charset. Would it be sensible to let the encoder make the decision to choose which production to use: the one that uses the built in type codec or the one that uses the generic string codec? This ability would also leave the door open for various optimization tricks. For instance, if the same float value ("0" typically) happens 100 times in a document, it may be more compact to encode it using the string codec and benefit afterwards from indexing than to encode 100 times the same values with the typed codec. Although these use cases may sound somewhat marginal, adding this flexibility would not hurt the interoperability nor hurt the decoder's footprint. What do you think? Regards, youenn
Received on Wednesday, 16 September 2009 17:13:42 UTC