RE: Substitution group handling

From: Taki Kamiya <tkamiya@us.fujitsu.com>
Date: Wed, 18 Nov 2009 11:17:40 -0800
To: <antoine.mensch@odonata.fr>, <public-exi-comments@w3.org>
Message-ID: <2EA6B0357C144A1381EC5951017D1E6D@homunculus>

If you want to use schemaID to indicate the schema components
defined in the schema of the root elements and their imports
and includes, you can do so. EXI requires both senders and
receivers to inform the EXI grammar system of the schema information 
from the exact same set of schema components before processing EXI 
streams. The premise of starting with the exact same grammars 
at the both ends is fundamental in guaranteeing interoperability.
Schema disparity has implications that cause to break interoperability.
For example, string tables are informed of the names and URIs
from schemas, and disparity would end up breaking the interoperable 
use of string tables. 

Substitution groups can be modelled precisely so that the set of 
elements that can supersede the group head element are a subset
of all the global elements defined in the schemas. On the other
hand, that is not the case with wildcard terms; any global elements 
are naturally eligible for replacing the group head element.

If every use of wildcard terms in schemas are expanded into productions
each representing an element particle, the amount of memory used 
to represent such productions tends to become intractable for some
devices. On the other hand, substitution groups are less likely to
cause significant burden on the memory footprint after expanded into 
a set of productions.

Hope this helps,


Hi Taki,

Thanks for your answer, but your mail does not really address my
concerns, which can be summarized as:

1) The schemaID value mechanism seems too complex: in order to ensure
interoperability, each exchange must be preceded by an out-of-band
agreement on the *set* of in-scope schemas to be used. It seems to me
that agreeing on the schema of the root element (and all the schemas
that it directly or indirectly imports) would be much simpler and
sufficient in most (all?) cases.

2) The handling of substitution groups is not consistent with the
handling of wildcards: if we assume that a set of schemas is in scope
for processing one document, then why not use the sorted list of global
elements of those schemas as a means to generate a restricted set of
event codes for wildcards (using the numbering scheme proposed for
substitution elements, and using SE(*) for elements in unknown schemas)?
Given the frequency of wildcards in schemas compared to the use of
substitution groups, I think this approach would in general have a
greater impact on the size of encoded documents than the substitution
group optimization. Note that I would still prefer that substitution
groups be handled like wildcards (because it would make interoperability
much easier to achieve), but at least it would be consistent.



> Antoine,
> Given a certain schemaID value, both the sender and the receiver of a
> schema-informed EXI stream must be unequivocally in agreement as to both
> the kind of schemas in use (XML Schema, Relax NG, etc.) and the exact set
> of schema components (definitions of elements, attributes, types, etc.) for
> use in processing EXI streams. This means that the schemaID must indicate
> the totality of the schema components for use in processing EXI streams,
> which may include more components than the ones that are defined by the
> schema that contains the root element. The system that hosts EXI processors
> may contain other schemas, however, the schema components therein will not
> be used for encoding and decoding of EXI streams.
> To achieve better efficiency, as much schema information as possible to the
> extent needed to accurately describe the elements and attributes expected to
> be used in the instance documents should be shared beforehand among parties
> and be included in the totality of schema components indicated by a schemaID
> value. When EXI streams contain elements and attributes not defined in the
> schema whole, they will be processed as schema deviations.
> Regards,
> -taki
> Thanks for the clarification. However, I am still not convinced that
> relying on the schemaID option to define the set of in-scope
> namespaces/schemas for a document exchange is feasible in complex
> distributed deployment scenarios, where servers and clients may use
> different versions of application software and therefore different sets
> of namespaces: while it seems relatively straightforward to ensure that
> a client and a server have all necessary schemas available for a given
> exchange, it is more than likely that they will have extra schemas
> available, and that they will never know which ones should be considered
> in scope for a given exchange unless this information is directly
> provided by the EXI stream.
> The set of in-scope schemas for a document exchange must be built from
> two sources:
> * The "static" set contains the schema of the root element of the
> document, plus all schemas that are directly or indirectly imported by
> this "root" schema. Exchanging the namespace of the "root" schema either
> out-of-band or through the schemaID option seems a practical and
> interoperable approach to determine the static set. Note that schemas of
> substitution elements are usually not imported by the "root" schema, and
> therefore cannot be automatically added to the static set.
> * The "dynamic" set contains the schemas of namespaces not present in
> the static set, that are introduced in the document through only four means:
> 1) Attributes matching the anyAttribute wildcard
> 2) Elements matching the any wildcard
> 3) Substitution elements
> 4) Deviations from the schemas
> The current EXI spec supports cases 1), 2) and 4) (when the strict
> option is off) through the use of SE(*) or AT(*) events, which allow the
> addition of the relevant namespaces to the dynamic set. However, in case
> 3) - and only in that case - , we have to rely on complex out-of-band
> information to add appropriate schemas to the static set. I think this
> is an inconsistency that may create interoperability problems in the
> future. Note also that substitution elements are a relatively obscure
> feature that is seldom used (at least in our experience), so relative
> inefficiencies introduced at this level will have a limited impact on
> the overall EXI performances.
> If you consider that my previous proposal is too complex, what about
> adding an option that would allow EXI encoders to use SE(*) for all
> substitution elements, thus avoiding the interoperability problem
> described above, at the expense of slightly less efficient compression
> (this would basically consider all substitution elements as deviations
> from the schema)? I actually think this should be the default behavior,
> and that the mechanism in the current spec should only be used in cases
> where the encoder is certain that it shares the same set of in-scope
> schemas with the decoder for a particular document exchange.
> Cheers
> Antoine
>> Hi Antoine,
>> Thanks for the comment and your careful attetion to the details of spec.
>> The EXI schema-informed grammar system is described in a way that is
>> solely concerned with the abstract schema model which is agnostic about
>> the physical schema composition (i.e. imports, includes and redefines)
>> that is in the separate realm of the XML Schema specification.
>> The schema information in effect for individual EXI stream is either
>> communicated out-of-band or through the schemaID option. This is described
>> in section "5.4 EXI Options". However, your suggestion to make the correlation
>> explicit is well taken, and we will add a sentence in "8.5 Schema-informed
>> Grammars" to that effect with reference to that description.
>> EXI does not try to leverage every feature of XML Schema exhaustively to
>> wring every potential efficiency out of schemas. Instead, those schema
>> features that EXI capitalizes on have been selected to achieve the best
>> use of the schema. This is based on empirical judgement on the effect and
>> broadness of the feature application while being keenly aware of the need to
>> balance between the benefit of extra compactness and the accrued complexity
>> that may adversely affect the code footprint and the processing efficiency.
>> In the case of the abstract element case you brought to the attention,
>> it is expected to cause only a slightest improvement if any in general
>> given the log_2(n) formula used in the Unsigned Integer representation.
>> We hope this helps to explain why EXI does not take advantage this XML
>> Schema feature.
>> Thanks!
>> -taki
>>> The following definition (section of the list of valid
>>> members of an element declaration substitution group seems underspecified:
>>>     Let S be the set of element declarations that directly or indirectly
>>>     reaches the element declaration PTi through the chain of
>>>     {substitution group affiliation} property of the elements, plus PTi
>>>     itself if it was not in the set.
>>> The actual contents of S cannot be determined by only looking at the XML
>>> Schema in which PTi is declared and the additional XML schemas it
>>> imports. Rather, the complete set of XML Schemas in scope must be
>>> considered to build S, as members of S can be contributed by each XML
>>> Schema that imports the XML Schema in which PTi is declared.
>>> It is therefore important to determine the set of XML Schemas in scope
>>> for a given EXI encoder/decoder, as shown in the example below:
>>> Let
>>> - "a" be an element declaration in XML Schema A,
>>> - "b" an element declaration in XML Schema B which has "a" as
>>> {substitution group affiliation} property,
>>> - "c" an element declaration in XML Schema B which has "a" as
>>> {substitution group affiliation} property.
>>> Let P1, P2 and P3 be three EXI processors which respectively have {A, B,
>>> C}, {A, B} and {A, C} as known XML Schemas.
>>> While in theory P1 and P2 could exchange schema-informed documents using
>>> both A and B, P1 and P3 could exchange documents using both A and C, and
>>> P2 and P3 could exchange documents using A, this will not be possible
>>> unless a precise and shared definition of the set S for element
>>> declaration "a" can be determined for each exchanged document. Indeed, a
>>> naive static implementation would generate incompatible sets S1={"a",
>>> "b", "c"}, S2={"a", "b"} and S3={"a", "c"} for
>>> P1, P2 and P3.
>>> Is it the intention of the WG that this issue be addressed using the
>>> SchemaId option? The current version of the spec leaves the use of this
>>> option completely open in such cases, and that could lead to
>>> interoperability issues. If it is nevertheless the case, it could at
>>> least be useful to clarify in section that S depends on the
>>> SchemaId option.
>>> The WG could perhaps consider an alternative approach where members of
>>> an element declaration substitution group are encoded as SE(*) the first
>>> time their namespace appear in the document, and using the scheme
>>> outlined in section afterwards. This would allow both the
>>> encoder and decoder to build the same set of in-scope namespaces for the
>>> document, thus guaranteeing interoperability if both processors share
>>> schemas for those namespaces. On the other hand, this would require the
>>> dynamic construction of the set S for all elements that are potential
>>> heads of substitution groups, thus deviating from the static approach
>>> used so far for schema-informed grammars.
>>> Still about section, a minor optimization could probably be
>>> obtained by excluding element declarations whose {abstract} property is
>>> true from the set S, as such elements should never occur in valid documents.
>>> Best regards,
>>> Antoine Mensch
