RE: [LC-2172] RE: EXI LC Comments from FABLET Youenn on 2009-01-08 (public-exi-comments@w3.org from January 2009)

From: FABLET Youenn <Youenn.Fablet@crf.canon.fr>
Date: Thu, 8 Jan 2009 10:22:59 +0100
To: Taki Kamiya <tkamiya@us.fujitsu.com>, "public-exi-comments@w3.org" <public-exi-comments@w3.org>
CC: "fujisawa.jun@canon.co.jp" <fujisawa.jun@canon.co.jp>, RUELLAN Herve <Herve.Ruellan@crf.canon.fr>
Message-ID: <C1797CB6A125334AB23C5A0A160944AD2C39E284AC@cressida.crf.canon.fr>
Dear Taki and WG,

thanks for studying this comment and for documenting your decision.
I am not strongly requesting the addition of this feature.
Please find some thoughts that came to mind after reading your answer.

First, the scenario I had in mind was mostly about constrained list of items, e.g. small arrays/matrixes.
The length facet could be of some interest there. I agree that the length facet for strings seems not that useful.

Second, in the current EXI state, it would be better in terms of compression to break a fixed-length list of values in separated containers (one
attribute or element for each value). This information may be of interest for people writing schemas and intending very good schema-based EXI compression. In some cases however, setting/getting values directly as an array is better suited for processing, hence a potential compression/efficiency tradeoff here.

Third, I am not particularly convinced by the potential string encoding efficiency penalty, since facet checking needs to be done anyway before the actual encoding of strings or integers for instance. I am also confident that some nice tricks could be designed to favor the 'no-facet' usual case.

Fourth, the length facet seems not widely used today, maxLength maybe a little bit more.
I guess that the benefit of specifying it is not sufficient today.
If it were added to EXI, this could be an incentive for some to actually use that feature.

Regards,
        youenn


-----Original Message-----
From: Taki Kamiya [mailto:tkamiya@us.fujitsu.com]
Sent: jeudi 18 décembre 2008 00:47
To: FABLET Youenn; public-exi-comments@w3.org
Cc: fujisawa.jun@canon.co.jp; RUELLAN Herve
Subject: [LC-2172] RE: EXI LC Comments

Hi Youenn,

This is in response to the item #1 of your set of comments.

The EXI encoding of simple type data, and the use and disuse of certain facets
has been decided on the basis of both empirical observation of merits and its
implications.

We had looked at facets related to length to see whether EXI should be aware of
those facets to improve the simple type value encodings. The result of implementation
experiment shared by WG members indicated that the effects of leveraging those
facets were not substantive enough to make it convincing to include the function
into the format specification, given that the addition of which implies for EXI
processors to check the presence of those facets for every occurrence
of schema-bound strings before determining the method of representing the length
field. In this particular incident, the concerns about the potential processing
efficiency penalty outweighted the benefit observed. Encoding rules associated
with strings, such as string tables, are so defined as to be adequately simple
because it is a known hot spot in the processor execution, and any subtle overhead
can accue into a noticeable cost in performance.

Also, please note that at least for repeated strings, string tables kick in after the
first occurrence of that string value. Therefore, the effect would be limited to the
first occurrence of a particular string value only. We presume that this is one of
the reasons why we did not see the level of benefits we originally expected as
you do, out of the length-related facets.

Thanks for all your comments!

Hope it helps,

-taki


________________________________

From: public-exi-comments-request@w3.org [mailto:public-exi-comments-request@w3.org] On Behalf Of FABLET Youenn
Sent: Thursday, November 06, 2008 8:16 AM
To: public-exi-comments@w3.org
Subject: EXI LC Comments



Dear EXI WG,

please find below some comments and questions regarding EXI specification last call working draft.

Regards,

                youenn fablet



1) Some facets are supported like minInclusive or maxExclusive.

What about the support of the length, minLength and maxLength facets which could be useful to better encode string or list sizes.

It should not be too difficult to support them based on current facet support.

Is there a rationale to not include these facets?



2) Guidelines for schema modeling

Is there any guideline regarding the relationship between EXI and schema modeling?

Guidelines would be useful to understand the impact of some schema modeling decisions on EXI encoding/decoding in terms of
efficiency and compression.

For instance, it seems that the more global constructs (elements, types, attributes), the bigger will be the generated grammars
since all global schema constructs need to be kept (right?),

having a lot of xs:all or maxOccurs="999" may also hurt efficiency.

See also question 3)



3) DataTypeRepresentationType question

I would like a confirmation of the current DataTypeRepresentationType behaviour.
Let's have a schema with the following attribute definition:

                <xs:attribute name="test" type="xs:string"/>

In that case, the only way to change the encoding for @test1 values with the DataTypRepresentationType feature

is to redefine xs:string which may have great impact.

If we only want to change the @test values with the DataTypRepresentationType feature, we would need to

change the schema as follow:

                <xs:simpleType name="mystring">

                                 <xs:restriction base="xs:string"/>

                </xs:simpleType>

                <xs:attribute name="test" type="mystring"/>

DataTypeRepresentationType could then be used to redefine mystring.

Is it correct?

If so, the interoperability will generally be lost, since interoperable DataTypeRepresentationType use is currently limited to XML
Schema part 2 predefined types redefinition (end of section 7.4).

What about extending that behaviour to all simple types that have been gathered by consuming the schema in use?

Is there any rationale behind that specific constraint?



4)  Typed encoding in schema-less mode

EXI enables limited typed encoding support in schema-less encoding.
Since only predefined types are supported, xsi:type seems mainly useful to encode base64 chunks with the binary encoding.

Even in that case, the usability is not so good : in some  cases, elements whose content is base64 have also attributes. For
instance ds:SignatureValue has an optional ID attribute.

Of course, one could still use xsi:type=base64Binary in deviation mode but interoperability may be pretty bad and putting a wrong
xsi:type for the purpose of compression seems broken.

Also to be noted that:

                - Attribute values cannot be typed encoded with schema-less grammars.

                - Other useful types like "list of float","list of integers" cannot be used without external schema knowledge.

Improved out-of-the-box support of this use case would be very helpful.



5) EXI schema-less/schema-informed modes

Based on internal discussions and internal feedback, there is a general assumption that the EXI specification somehow defines two
separate modes (schema-less and schema-informed).

While this is clearly stated in the specification that both modes easily coexist in a single EXI stream,

additional advertisement (maybe in the primer) of that feature may be good for adoption.

The latest published primer (dec 2007) could maybe be improved with that respect.



Additionaly, while EXI provides great flexibility in the amount of schema put in grammars,

the schemaID mechanism seems very minimal.

It seems that interoperable uses of schema-informed EXI will greatly restrain the use of this flexibility.

Is there some additional work in that area that could or will be further conducted?



6) Is it conformant to not follow the attribute order in the case of a schema-informed grammar encoded element in deviation mode?

As stated in  section 6, it seems not conformant.

In some cases, grammars can support attributes in no particular order, such as the example below (correct me if I got something
wrong).

<xs:complexType name="test">

                <xs:attribute name="name" type="xs:string"/>

                <xs:anyAttribute namespace="#any"/>

</xs:complexType>

<xs:element name="test" type="test"/>



While the benefit of ordering the attributes at the grammar level and the general compression benefit for encoders to follow the
given order are obvious, I do not see compelling reasons of including this constraint in the format itself.

At the encoder side, the encoder may decide to order attributes or not.

If encoding fails due to bad ordering (in strict mode) or if the compression ratio is bad, the encoder can always decide to order
the attributes.
At the decoder side, the decoder is only following the grammars so it does not really care about the ordering.

There is even a drawback as this is one (major ?) difference between schema-informed and schema-less processing.

Am I missing something obvious?



7) RDF/XMP use case

This is more a general comment on specific XML/EXI use cases, notably RDF or XMP documents where

no standard, well defined XML schemas are available.

These documents generally have some defined structures and types (RDF schema, XMP schemas.) but no

well defined XML schemas.

What would be the recommendation from the WG to enable good interoperable EXI compression? Stick with schema less encoding? Create a
XML schema, publish it and use it?



8)  Through careful checking of published EXI encoded streams

(Thanks again for the publication of these encoded examples by the way!),

Herve found some potential differences between the streams and the specifications (see below).



9)

Section 8.5.4.4.1:

  When adding production:

                                AT (qname) [schema-invalid value] Element?,?

to Elementi,j

Which next Symbol should be used?

Spec says Elementi,j

It would be more logical to use the symbol from the production:

                                AT (qname) [schema-valid value] Elementi,k



10)

Section 9.3

"Value channels that contain no more than 100 values" seems to mean: with *strictly* less than 100 values.

In this paragraph, all comparison should be made clearer using 'greater or equal' and 'strictly greater'.



11)

Section 8.4.3

In Schema-less mode, EE productions should be promoted to event code 0 when used (if no EE production with an event code length of 1
already exist).



12)

Section 8.4.3

In Schema-less mode, when using the SE(*) production, should the creation of the SE(qname) production be done before the evaluation
of the element content?



In most case, this has no impact. In case of recursive elements, this leads to better compaction.

Moreover, in case or recursive elements, the current specification seems to imply creating several SE(qname) productions.



13)

Section 8.4.3

xsi:schemaLocation attributes seems to be removed from the infoset before encoding in agile delta streams.

Is it by design or is it implementation related?



14)

Section 7.3.3

Empty strings can occur as attribute values.

Section 7.3.3 suggests that these empty strings are to be added in indexing tables.
The current litteral EXI encoding being compact enough, it is reasonnable not to add them in the table.
Received on Thursday, 8 January 2009 09:23:46 UTC