LC-172 dropping the ANY wildcard (response to your comment)

Dear Murray:

The W3C XML Schema Working Group has spent the last several months
working through the comments received from the public on the last-call
draft of the XML Schema specification.  We thank you for the comments
you made on our specification during our last-call comment period, and
want to make sure you know that all comments received during the
last-call comment period have been recorded in our last-call issues
list (http://www.w3.org/2000/05/12-xmlschema-lcissues).

Among other issues, you raised the point registered as issue LC-172,
which suggests (at least implicitly) that we drop the ANY wildcards
for child elements and attributes, as being too complex for a version
1.0 language.

In the course of dealing with the last-call comments, the WG
considered this issue and asked me to convey to you their thanks for
the suggestion and their reasons for not accepting it.

The case for eliminating wildcards lies in their complexity, and the
simplification to the spec which would result from eliminating them.

The case for retaining them is that without wildcards of some kind,
XML Schemas are incapable of defining the kinds of languages many
schema authors would like to be able to define, or modeling the kinds
of extensibility exhibited by, say, existing HTML browsers.

The ANY keyword of SGML and XML 1.0 does not allow elements to be
defined which allow arbitrary blocks of well-formed XML as their
content; it is thus impossible, in XML 1.0, to define a DTD for (say)
a protocol-oriented envelope, which carries arbitrary XML as its
payload. Our wildcards do that, and thus make it possible to have
schemas with 'black-box' areas. This is one of the things Schema does
which is clearly more expressive than DTDs, and it is essential to
allow the 'X' of XML to be meaningful not only in cases where
validation is foregone, but in cases where document types are formally
defined and validated.

The use of wildcards also makes it much easier to apply different
schemas to the same document instance: a schema for (say) tables can
say that within a cell, any non-table element (i.e. any element in a
namespace other than the table namespace) can occur; this is in effect
what the SGML Open table model is intended to allow, with the
difference that the SGML Open DTD fragment is compelled to say this in
a comment, and provide a parameter-entity hook which a hostile user
can easily misuse and defeat.  The XML Schema formulation allows
the same basic idea to be expressed in the schema proper, which is
(I submit to you) where it belongs.

You ask about the interaction of the ANY wildcards with the various
levels of schema validation (strict, skip, and lax).  Let me try to
summarize the situation, imagining for concreteness that we are
talking about a 'cell' element in a table module whose content is zero
or more ANY-other-namespace wildcards.

   - with STRICT validation, the ANY wildcard effectively allows
any element outside the table namespace for which a declaration is
provided.  If the cell element has mixed content, this is almost
the same as an SGML/XML ANY keyword.  (Since we have excluded the
elements of the table namespace, it's not quite the same; if we
did not exclude them, it would be exactly the same.)
   - with SKIP validation, the ANY wildcard effectively says each
child element within the table cell is a black box of well-formed
XML, which is not to be looked at for validation purposes.  This
is not the way I would define a table module, myself, but if we
know that tables never nest, then this behavior resembles that of
some table processors I have heard about.  The SKIP validation with
ANY keywords is exactly suited, however, to the definition of
document envelopes which can take anything as their payload, including
another envelope.  The processor for the top-level envelope should
pay attention only to the top-level envelope, not to any nested
envelopes; SKIP validation expresses this approach from the point
of view of validation.  (Why skip validation of the envelope
contents?  Perhaps I want to send you, in an envelope, an envelope
which I know to be invalid, in order to ask you "Why is this
envelope not valid?")
   - with LAX validation, the ANY wildcard effectively says "any
element can go here -- but if the schema includes a declaration for
that element, it should be validated in the normal way."  This is
what some people call 'opportunistic' validation; it might be used
(for example) to check the structure of all the tables, including
nested tables, in a document, and nothing else:  construct a schema
just with the table declarations, and validate.

Existing systems take three approaches to extensibility of markup
languages, and three approaches to handling unknown elements.

  (1) They can say "an unknown element is an error" -- that is, in
      effect, what STRICT validation specifies.

  (2) They can say "the start- and end-tags of an unknown element are
      skipped, and the content is processed in the normal way" -- that
      is, in effect, what LAX validation does.  As the experience of
      the Web has shown, this approach to extensibility is just what is
      needed to allow peaceful coexistence of old software with certain
      kinds of extensions to the markup language.  Effectively, the
      rule to ignore tags for unknown element types amounts to saying
      "treat undeclared elements as if declared with mixed content and
      zero or more ANY wildcards, and perform lax validation,
      i.e. validate any children for which you have declarations".  As
      you know, this works with some but not all kinds of extensions to
      a markup language: sometimes what you need to say is "undeclared
      elements should be skipped in their entirety."

  (3) They can say "unknown elements are to be skipped" regardless
      of their contents -- that is what SKIP validation does.

I have not discussed the ANY-ATTRIBUTE wildcard, but I believe you
will see how it can be used.

Since the desire to be able to declare a 'well-formedness slot' in a
markup language is one of the most common requests for improvements on
the capabilities of DTDs, I think the WG was right to design the ANY
wildcards into the language, and I hope the discussion above helps
persuade you that the WG did the right thing in retaining them despite
your invitation to remove them.

It would be helpful to us to know whether you are satisfied with the
decision taken by the WG on this issue, or wish your dissent from the
WG's decision to be recorded for consideration by the Director of
the W3C.

best,

Michael

Received on Thursday, 5 October 2000 15:25:37 UTC