RE: Issue on EXI value validation

Dear Youenn,
 
Thank you for providing feedback on the EXI specification. You are indeed
correct that the current EXI specification's use of the terms "schema-valid"
and "schema-invalid" imply some unfortunate and undesirable consequences. As
it turns out these consequences were not intended and this is the reason the
section regarding restricted character sets defines rules for representing
schema-invalid values using a [schema-valid value] representation. 
 
Based on your feedback we are updating the specification to use more
appropriate terminology. In particular, the terms [schema-valid value] and
[schema-invalid value] will become [schema-typed value] and [untyped value]
respectively. This, along with some other updates, will eliminate any
perceived requirements for schema-validation. The specification will also
make it clear that [schema-typed value]s SHOULD be used when the value can
be represented using the associated EXI datatype. This, and other updates,
will give implementations the flexibility to use [untyped value]s even in
cases where [schema-typed values] would have worked. 
 
I hope this helps to address your concerns. Thank you again for your
comments!
 
    Best wishes,
 
    John
 
AgileDelta, Inc.
 <mailto:john.schneider@agiledelta.com> john.schneider@agiledelta.com
 <http://www.agiledelta.com/> http://www.agiledelta.com


  _____  

From: public-exi-request@w3.org [mailto:public-exi-request@w3.org] On Behalf
Of FABLET Youenn
Sent: Monday, August 24, 2009 8:07 AM
To: 'public-exi@w3.org'
Subject: Issue on EXI value validation



Dear all,

 

Reading the EXI specification, I have the following issue (two additional
related sub-issues are also described below).

 

1.  Schema validation of XML values

 

As per the current specification, when a schema informed grammar is in use,
the encoder needs to check the validity of all XML values to retrieve the
correct production (section 8.5.4.4.1).

AIUI, the validity is computed according the schema in use.

This adds a real burden on EXI implementations, mainly in terms of
processing efficiency but also in terms of compression:

1)      The EXI encoder needs to keep all simple type information present in
the schema (increased in-memory schema representation)

2)      The EXI encoder needs to implement XSD1.0 part 2, regexp validation
notably (increased code footprint)

3)      The EXI encoder needs to do schema part 2 validation on all XML
values(processing penalty, reg exp for instance)

4)      Some invalid values can be correctly encoded using the built-in EXI
type (compression penalty)

 

Some examples that illustrate some of the issues:

-          A string 'abc' would be encoded using a schema-invalid production
because its length is 3 and its simple type definition has a length facet of
4.

-          The float '1.0' would benefit from being encoded as a float even
if its simple type states that it must be in the range [0,1.0[ 

-          The integer '19' would be correctly encoded using a 5 bits
integer encoding even if its simple type states that the integer range is
[0,18].

-          A string 'ofo' would benefit from being encoded according the
regexp {foo|oof} although it is an invalid value

 

I understand that schema validation is well defined in schema part2
specification and already well deployed but the purpose of EXI is to achieve
very good compression not validation.

Replacing MUST statements by SHOULD statements (to state that valid values
SHOULD be encoded with the schema-valid productions and invalid values
SHOULD be encoded with the schema-invalid productions) may be sufficient?

Or maybe there is a simple way to redefine the validation criteria in terms
of whether a specific codec can actually represent a given XML value
(schema-valid production) or not (schema-invalid production)?

 

2. Restricted charset behavior

 

Section 7.1.10.1 states that string characters not present in the restricted
charset may be encoded using a specific technique. 

According section 8.5.4.4.1, restricted charset encoding will be limited to
valid values.
Schema-valid values will contain only characters in the restricted charset
and whitespaces.

I am therefore wondering whether the ability to encode characters not in the
restricted charset is restrained to whitespaces only or if its purpose is
larger.

 

3. Automatic selection of productions

 

The use of restricted charset may be useful when compressing valid values
but also invalid values.

But compression may hurt from invalid values that have few or no characters
in the charset (and we cannot always change the schema to improve the
situation).

While it can be difficult to draw a specific line in the specification, the
best decision can always be done by the encoder which has the knowledge of
both the string to encode and the restricted charset.

Would it be sensible to let the encoder make the decision to choose which
production to use: the one that uses the built in type codec or the one that
uses the generic string codec?

 

This ability would also leave the door open for various optimization tricks.
For instance, if the same float value ("0" typically) happens 100 times in a
document, it may be more compact to encode it using the string codec and
benefit afterwards from indexing than to encode 100 times the same values
with the typed codec. 

 

Although these use cases may sound somewhat marginal, adding this
flexibility would not hurt the interoperability nor hurt the decoder's
footprint.

What do you think?

 

Regards,

                youenn

 

Received on Wednesday, 16 September 2009 17:13:42 UTC