A Schema Typing Proposal

While XPath 2.0 brings up issues around XML Schema, I think the
problem of sets of schemata is a general issue for any choice of
schema language and their use in XML Pipelines.

When I added schema processing steps to smallx, the first thing I needed
was a way to define a set of namespace to resource mappings.  You can't
rely on the xsi:schemaLocation et. al. attributes and some people,
myself included, think of that as huge hacks that we'd want to avoid.

I think the issue of typing comes down to three general problem
areas:

P1. Defining a set of compatible schemata that represent the "known
    universe" of elements, attributes, and types.  Overlapping sets
    need to be defined so that different kinds of validation can
    be performed for the same namespace names.

P2. Use of the sets defined in (1) within a component.

P3. Inter-step typing and type comparison.

We can start with a simple assumption that there is some kind of
infoset annotation (e.g. a PSVI) that can be passed between
components that holds the additional type information.  This could
manifest itself as an XPath 2.0 data model instance or some other
infoset-based API.  This somewhat addresses (P3).

The remaining problem with (P3) is that how are we assured that
a particular type name maps to the same type definition?  That is,
if two different steps in the pipelines use schemata that use
the same names but different types or type definitions, what happens
to an XPath in the pipeline that uses that type?

I think there is a couple simple notions that will help
us here:

    N1. Notional equivalence by type name.

    N2. Previous steps that produce the annotated infoset are
        determined by the pipeline author.

There are two places where we need to be concerned about typing [1]:

   * matching by type (i.e. "instance of" expressions)
   * comparing simple typed values

In the former case, instance-of takes a QName value.  This is where (N1)
can help us.  We could say, regardless of type identity, if the name is
the same it is the same type.

In the latter case, we'll get a type error if the value from the
infoset isn't annotated with the right simple type.  As such, the
user of the pipeline can:

   * guarantee the correct schema is used by the previous step
     that produced the input on which the expression is to be applied.
   * use simple type constructors in the XPath 2.0 expression

I think if we accept both (N1) and (N2), we have a reasonable story
around typing that says that we stop at type names.  If you need to
guarantee that everything uses the same type definitions, you do that
by, well, using the same type definitions in every step that uses
schemata.

What we gain is that we can have pipelines that have different
definitions for the same target namespaces.  That can be very
useful when you know you want to loosen or tighten constraints
before or after different steps in the pipeline.  In the end, you
can use type selection if you make sure that you aren't mis-typing
simple-typed values that you need to compare.  I don't see that as
onerous for the pipeline author.

The last remaining issue is around the sets of schemata that you
might want to use within a pipeline.  If you want to validate
at different steps within the pipeline with different sets of
schemata for the same namespaces, you need some way to control
the namespace name to resource mapping.

Here I think we have two choices:

   1. Revive the concept of a resource manager.
   2. Create a specialized construct for namespace name to
      resource mappings.

I think (2) could be useful for other components that aren't
schemata (e.g. business rules/constraint languages).

When I added the validate step in smallx, I had a directed syntax
for the step and so I just added the mapping for (2) into the step
syntax (see [2]).  I think we need a more general approach than
this.

One idea is to just have a definition of namespace name to resource
mapping available to any component.  It could be considered a static
document resource that is just another input to the component
that resides within the pipeline.

For example (ignoring the choice of namespace):

<resource-map id='schema-set-1'>
<map uri='http://www.example.com/Vocabulary/MyStuff/2006/1/0'
      href='mystuff.xsd'/>
<map uri='http://www.w3.org/1999/xhtml'
      href='xhtml.xsd'/>
<map href='default.xsd'/>
</resource-map>

where the last element maps the no-namespace uri to some resource.

While we could use OASIS XML Catalogs [3], those catalogs don't handle
mapping the no namespace name to a resource.  If they could fix that,
I'd be happy to use that XML over the above.

This definition would be embedded in the pipeline as some
resource/document that a step could reference.  It could even be an
external document to the pipeline.

Solving (P1) and (P2) amounts to telling the "schema validate" step
which of these resource maps to use.

[1] http://www.w3.org/TR/xpath20/
[2] https://smallx.dev.java.net/pipeline-spec.html#section-d0e890
[3] http://www.oasis-open.org/committees/entity/spec-2001-08-06.html

--Alex Milowski

Received on Sunday, 14 May 2006 21:47:32 UTC