RE: [XML Schema 1.1] Using doc() in xs:assert ... the referenced document needs a schema? from rjelliffe@allette.com.au on 2009-05-03 (xmlschema-dev@w3.org from May 2009)

From: <rjelliffe@allette.com.au>
Date: Sun, 3 May 2009 15:59:11 +1000 (EST)
To: xmlschema-dev@w3.org
Message-ID: <3200.203.33.167.144.1241330351.squirrel@intranet.allette.com.au>
>> > I think the working group felt that
>> > introducing context-dependent validation (where the validity of a
>> > document depends on factors other than the schema and the instance
>> > document) was a risky architectural innovation, and possibly a step
>> > that would be later regretted.
>>
>> So they actually had no reason? Just some vague possibility.
>
> Actually, now I recall some of the discussion: one of the concerns was
> specifically the subject of this original question. Should the referenced
> document be schema-validated? If so, how do you prevent circularities, or
> infinite regress? Some members of the WG feel strongly that validation
> should never pose any denial-of-service risks, and allowing doc() opens up
> all sorts of possibilities.

"All sorts of possibilities"?  Why are any more possibilities for DOS
opened up than currently exist with
import/include/redefine/xsi:schemaLocation/xsi:noNamespaceSchemaLocation?

Security must be a layer above schemas. Securuty policy decides which
features are safe or unsafe in a particular deployment, but they needn't
decide which features to leave in or out. The DTD spec has entities, but
the security warning about the billion laughs attack.

In practical terms, retrieval, validation and use of external files should
be a command-line or invocation option not one that needs to stymie the
XSD WG: make the default that only relative or local URLs can be used, and
with lax validation, for example.

Trying to find the mythological optimum solution in the absence of
information will of course lead to the conclusion that no decision can be
made. But trying to find the optimal solution is itself an irrational
approach in that situation; the minimum to declare victory is better.

> Potential problems like this can consume an immense amount of WG time, and
> when a spec is running years late already, there is a strong temptation
> for
> the chair to encourage people to cut a feature rather than spend time
> discussing whether or not it creates a problem.

But the bottom line is that when a committee succumbs to agreeing to
arguments like "there may be ramifications we have not identified", then
it open itself to a selection of features merely based on the whims of one
implementer or another. It is a get-out-of-jail free card, since while it
looks like a rational argument it actually cannot be argued against by any
facts or specific arguments. That is because it is a flawed kind of logic.
It gives the appearance of being prudent, but risk assessment is based on
looking at real problems and issues, not asserting risks in the absence of
any evidence or issues, if you see what I mean.

Is it really that XSD is such a rathole of ineracting effects that issues
ccannot be thought through, or is it that there is a (legitimate, I would
say) difference between the database/data-binding stakeholders who only
want schemas to drive their storage/CRUD requirements, and
web/messaging/publishing/QA/QC stakeholders who need the schema
constraints to reflect their web-based information arrangements?

In either case, the solution has to be to take a big axe to XSD:

  * XSD -lite with no type derivation syntax. Only built-in simple types.
Allow xs:choice inside a simple type as an alternative syntax for
enumerations. No list or union. No facets. No extension. Some
alternative syntax for complexTypes with simpleContent so that there is
complexContent and simpleContent can left out. UPA as a caution message
not a fatal warning. Fewer restrictions (progress has been made on this,
I see). PSVI as an optional In fact, a reconstruction of RELAX NG in XSD
syntax, and a prelude for giving up type derivation of complex types as
a bad joke.

  * XSD -fat as a layer on top with type derivation, strict UPA,
assertions, etc. reconstructing the existing syntax.

XSD after 10 years is facing exactly the same situation that SGML had
after 8 years in 1996:stakeholders saw that revision plans were revisions
in the direction of being more complex rather than simplifications (so
that other layers would reconstruct the culled functionality.)

Back 9 years ago, when I was on the WG, there were comments from WG
members against modularization or optionality along the lines that
validity should always be validity. If this were the real test, then XSD
has proven itself an utter failure: witness the XSD profiles from
different groups, such as the data-binding profiles at W3C. I trust that
on the XSD WG nowadays if anyone talks of reliably-widespread
implementation of any components, they get laughed back into reality: the
stable door is open, the horse has bolted--it would be more prudent to
close the door before putting in more horses.

Of course, then comes the predictable excuse that "when we look at it,
everyone has different requirements" as if doing nothing was an
improvement on at least meeting some stakeholder chunks.

People think XSD Recommendation is horrible, and based on barmy editorial
principles. But aside from any eccentricities of the specs (and the
Recommendations have many virtues too), ultimately the problems with the
Recommendations are caused by the complexity of the underlying technology.

If the XSD WG wanted a rule-of-thumb for how big a layer should be, I
would suggest this: no technology should be so big that an experienced and
excellent programmer would take more than a month full-time to implement
the layer.

I think XML Schemas Datatypes meets this goal. RELAX NG and Schematron and
all the parts of DSDL do. But clearly XML Schemas Structures does not: I
am not sure that someone could even get competent in the Recommendation in
one month.

The XSD 1.1 revision is a great step forwards, but this is not much use
when you are falling down a hole. Of course I think assertions and so on
are really useful. But they are being tacked onto a Heath Robinson/Rube
Goldeberg machine, and it doesn't have to be that way.

Cheers
Rick Jelliffe
Received on Sunday, 3 May 2009 06:00:10 UTC