Subset variants: Comments on WD-xmlschema-1-19990924 from Rick Jelliffe on 1999-10-08 (www-xml-schema-comments@w3.org from October to December 1999)

From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
Date: Fri, 8 Oct 1999 22:03:14 +0800
To: <www-xml-schema-comments@w3.org>
Message-ID: <003901bf1195$dd4be0c0$8d066d8c@sinica.edu.tw>
Variation is a fact of life for languages (e.g., structural schemas) as much
as anything else.

Taking HTML as a primary example, we find three official variants defined at
W3C, which in turn are abstractions of many variants implemented through
different generations of browsers.   Similarly, a document going through a
workflow may have different levels of validation that are appropriate:
either because the structure through the workflow changes or because the
system designer does not want to know about particular validity errors at a
certain stage. Similarly, it
is clear that a markup language evolves over time.

This needs to be made manageable and simple.

The current WD does not have UP-TO-DATE sections that address this issue;
the architectural refinement section is the closest.  It does not have
concept of subset variants. I can easily imagine that other efforts (for
example, based on schema composition) may also miss this essential
characteristic unless it is explicitly addressed.

Subset variants can readily be handled using the following mechanism:
    * every element, content model, archetype reference, element reference
and attribute group reference, etc. should take an extra attribute "variant"
which contains a list of names.
    e.g.
        <element name="blink"  type="inline"
            variant="html-slack"  />

    * this parameter can be used in two ways.
        - a document instance can be validated against the schema in the
usual way; the validator can provide a list of which variants were found
during the parse.
        - a document instance can be validated against the schema allowing
only a provided set of variants;  the validator reports "schema-valid" or
"schema-invalid" only. This provides much stronger typing.

A variant attribute would also be useful for creating editing tools during a
workflow, to accomodate a division of labour: the operator might deem
metadata to be a "variant" and then validate the document against everything
except the metadata.

A variant attribute therefore only makes sense on the roots of
non-required structures.

The advantages of this approach are, I believe:
    * convenient and obvious to compute
    * intuitive for schema-writers
    * avoids the problems of multiple-inheritance
    * does not require tracing through a chain notionally
     (and perhaps physically) separate schemas to resolve
    * allows the specification of reduced content models; it seems that the
issue of how to extend existing schemas has been taking the WG's time rather
than the issue of how to subset the schema;
    * allows convenient description of HTML instead of 3 DTDs;
    * provides a mechanism for "modular HTML" as well.

I suspect that this approach may also simplify the issue of
schema extension: a "composition" or "inheritance" or
"refinement" system may more comfortably do its thing
for superset or piecemeal schemas.

(I suppose the alternative to this approach is to use
some kind of "exclusion" schema, in which a list
of exclusions is associated with some model. This
has all the disadvantages of being externally specified,
verbose, unintuitive, and poor modeling.)

(Note: it may be that make this proposal workable, there
may also need to make the variant names first-class, with
declarations and a URL. This is a different issue.)

Theoretically, a "variant" schema is a subtype of the main schema, but not
declared using an inheritence mechanism.
Furthermore, because more than one variation can be
in operation at any time, a variation is perhaps better thought of as the
reification of a module where that module may have effects thoughout the
schema.

It may be argued that the idea of "variants" is out of line with formal
computer language notions. I would note instead that that grammar systems
modeling real-world phenomena often need exactly this kind of factility: I
note the presense of "guards" on transitions in UML statechart diagrams, and
the notion of phases in states which is used in some engineering modeling (p
hase represents persistent data between invocations of a state).
Furthermore, as noted, the existence of variants as described above provides
no algorithmic challenges to an implementer or theoretical challenges for
schema composition.

I commend this to the Schema WG as an official comment on the current
working draft.

Rick Jelliffe
Computing Centre
Academia Sinica (W3C Member)
Taipei, Taiwan
Received on Friday, 8 October 1999 10:07:15 UTC