RE: Substitution group handling from Taki Kamiya on 2009-11-03 (public-exi-comments@w3.org from November 2009)

From: Taki Kamiya <tkamiya@us.fujitsu.com>
Date: Tue, 3 Nov 2009 08:50:26 -0800
To: <antoine.mensch@odonata.fr>
Cc: <public-exi-comments@w3.org>
Message-ID: <F4354E2614C642968FE6A23332FB3843@homunculus>
Antoine,

Given a certain schemaID value, both the sender and the receiver of a
schema-informed EXI stream must be unequivocally in agreement as to both
the kind of schemas in use (XML Schema, Relax NG, etc.) and the exact set
of schema components (definitions of elements, attributes, types, etc.) for
use in processing EXI streams. This means that the schemaID must indicate
the totality of the schema components for use in processing EXI streams,
which may include more components than the ones that are defined by the
schema that contains the root element. The system that hosts EXI processors
may contain other schemas, however, the schema components therein will not
be used for encoding and decoding of EXI streams.

To achieve better efficiency, as much schema information as possible to the
extent needed to accurately describe the elements and attributes expected to
be used in the instance documents should be shared beforehand among parties
and be included in the totality of schema components indicated by a schemaID
value. When EXI streams contain elements and attributes not defined in the
schema whole, they will be processed as schema deviations.

Regards,

-taki

-----Original Message-----
From: Antoine Mensch [mailto:antoine.mensch@odonata.fr]
Sent: Thursday, October 15, 2009 1:43 AM
To: Taki Kamiya
Cc: public-exi-comments@w3.org
Subject: Re: Substitution group handling

Hi Taki,

Thanks for the clarification. However, I am still not convinced that
relying on the schemaID option to define the set of in-scope
namespaces/schemas for a document exchange is feasible in complex
distributed deployment scenarios, where servers and clients may use
different versions of application software and therefore different sets
of namespaces: while it seems relatively straightforward to ensure that
a client and a server have all necessary schemas available for a given
exchange, it is more than likely that they will have extra schemas
available, and that they will never know which ones should be considered
in scope for a given exchange unless this information is directly
provided by the EXI stream.

The set of in-scope schemas for a document exchange must be built from
two sources:

* The "static" set contains the schema of the root element of the
document, plus all schemas that are directly or indirectly imported by
this "root" schema. Exchanging the namespace of the "root" schema either
out-of-band or through the schemaID option seems a practical and
interoperable approach to determine the static set. Note that schemas of
substitution elements are usually not imported by the "root" schema, and
therefore cannot be automatically added to the static set.

* The "dynamic" set contains the schemas of namespaces not present in
the static set, that are introduced in the document through only four means:
1) Attributes matching the anyAttribute wildcard
2) Elements matching the any wildcard
3) Substitution elements
4) Deviations from the schemas

The current EXI spec supports cases 1), 2) and 4) (when the strict
option is off) through the use of SE(*) or AT(*) events, which allow the
addition of the relevant namespaces to the dynamic set. However, in case
3) - and only in that case - , we have to rely on complex out-of-band
information to add appropriate schemas to the static set. I think this
is an inconsistency that may create interoperability problems in the
future. Note also that substitution elements are a relatively obscure
feature that is seldom used (at least in our experience), so relative
inefficiencies introduced at this level will have a limited impact on
the overall EXI performances.

If you consider that my previous proposal is too complex, what about
adding an option that would allow EXI encoders to use SE(*) for all
substitution elements, thus avoiding the interoperability problem
described above, at the expense of slightly less efficient compression
(this would basically consider all substitution elements as deviations
from the schema)? I actually think this should be the default behavior,
and that the mechanism in the current spec should only be used in cases
where the encoder is certain that it shares the same set of in-scope
schemas with the decoder for a particular document exchange.

Cheers

Antoine


Taki Kamiya a ecrit :
> Hi Antoine,
>
> Thanks for the comment and your careful attetion to the details of spec.
>
> The EXI schema-informed grammar system is described in a way that is
> solely concerned with the abstract schema model which is agnostic about
> the physical schema composition (i.e. imports, includes and redefines)
> that is in the separate realm of the XML Schema specification.
>
> The schema information in effect for individual EXI stream is either
> communicated out-of-band or through the schemaID option. This is described
> in section "5.4 EXI Options". However, your suggestion to make the correlation
> explicit is well taken, and we will add a sentence in "8.5 Schema-informed
> Grammars" to that effect with reference to that description.
>
> EXI does not try to leverage every feature of XML Schema exhaustively to
> wring every potential efficiency out of schemas. Instead, those schema
> features that EXI capitalizes on have been selected to achieve the best
> use of the schema. This is based on empirical judgement on the effect and
> broadness of the feature application while being keenly aware of the need to
> balance between the benefit of extra compactness and the accrued complexity
> that may adversely affect the code footprint and the processing efficiency.
>
> In the case of the abstract element case you brought to the attention,
> it is expected to cause only a slightest improvement if any in general
> given the log_2(n) formula used in the Unsigned Integer representation.
> We hope this helps to explain why EXI does not take advantage this XML
> Schema feature.
>
> Thanks!
>
> -taki
>
>
> -----Original Message-----
> From: Antoine Mensch
> Sent: Monday, September 28, 2009 12:55 AM
> To: public-exi-comments@w3.org
> Subject: Substitution group handling
>
>
>> The following definition (section 8.5.4.1.6) of the list of valid
>> members of an element declaration substitution group seems underspecified:
>>
>>     Let S be the set of element declarations that directly or indirectly
>>     reaches the element declaration PTi through the chain of
>>     {substitution group affiliation} property of the elements, plus PTi
>>     itself if it was not in the set.
>>
>>
>> The actual contents of S cannot be determined by only looking at the XML
>> Schema in which PTi is declared and the additional XML schemas it
>> imports. Rather, the complete set of XML Schemas in scope must be
>> considered to build S, as members of S can be contributed by each XML
>> Schema that imports the XML Schema in which PTi is declared.
>>
>> It is therefore important to determine the set of XML Schemas in scope
>> for a given EXI encoder/decoder, as shown in the example below:
>>
>> Let
>> - "a" be an element declaration in XML Schema A,
>> - "b" an element declaration in XML Schema B which has "a" as
>> {substitution group affiliation} property,
>> - "c" an element declaration in XML Schema B which has "a" as
>> {substitution group affiliation} property.
>>
>> Let P1, P2 and P3 be three EXI processors which respectively have {A, B,
>> C}, {A, B} and {A, C} as known XML Schemas.
>>
>> While in theory P1 and P2 could exchange schema-informed documents using
>> both A and B, P1 and P3 could exchange documents using both A and C, and
>> P2 and P3 could exchange documents using A, this will not be possible
>> unless a precise and shared definition of the set S for element
>> declaration "a" can be determined for each exchanged document. Indeed, a
>> naive static implementation would generate incompatible sets S1={"a",
>> "b", "c"}, S2={"a", "b"} and S3={"a", "c"} for
>> P1, P2 and P3.
>>
>> Is it the intention of the WG that this issue be addressed using the
>> SchemaId option? The current version of the spec leaves the use of this
>> option completely open in such cases, and that could lead to
>> interoperability issues. If it is nevertheless the case, it could at
>> least be useful to clarify in section 8.5.4.1.6 that S depends on the
>> SchemaId option.
>>
>> The WG could perhaps consider an alternative approach where members of
>> an element declaration substitution group are encoded as SE(*) the first
>> time their namespace appear in the document, and using the scheme
>> outlined in section 8.5.4.1.6 afterwards. This would allow both the
>> encoder and decoder to build the same set of in-scope namespaces for the
>> document, thus guaranteeing interoperability if both processors share
>> schemas for those namespaces. On the other hand, this would require the
>> dynamic construction of the set S for all elements that are potential
>> heads of substitution groups, thus deviating from the static approach
>> used so far for schema-informed grammars.
>>
>> Still about section 8.5.4.1.6, a minor optimization could probably be
>> obtained by excluding element declarations whose {abstract} property is
>> true from the set S, as such elements should never occur in valid documents.
>>
>> Best regards,
>>
>> Antoine Mensch
>>
>>
>
>
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.421 / Virus Database: 270.14.15/2434 - Release Date: 10/13/09 19:11:00
>
>
Received on Tuesday, 3 November 2009 16:51:18 UTC