On subsetting XML...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Several weeks ago, I took an action item to draft up some of my
thoughts about what an XML subset should look like. The TAG has
discussed these ideas a couple of times and, while it's probably an
overstatement to say that we have consensus, we did agreed that it was
time to distribute these ideas more broadly and solicit more input.

Comments, etc., most welcome, as always.

Profiling XML, providing more implementation options, will necessarily
increase the possibility of interoperability problems and it would be
best to avoid doing so. Profiles are a bad idea on general principles
and are in direct conflict with one of the original goals of XML[1]: "the
number of optional features in XML is to be kept to the absolute
minimum, ideally zero."

Unfortunately, a number of user communities have expressed a need to
work with only a subset of XML. The TAG is concerned that if these
needs are not addressed quickly (and centrally), a number of slightly
different XML subsets will arise and if this trend continues, the
stability of XML as the basis of a whole range of technologies could
be jeopardized.

One way to avoid this problem is to produce a new recommendation that
identifies a subset of XML for use in those environments where
supporting all of XML is not practical.

One obvious place where such a subset has been deployed is in SOAP[2].
SOAP forbids internal and external subsets and strongly discourages
processing instructions.

When asked, the XML Protocol WG listed these[3] among their reasons for
subsetting:

 * Performance: processing internal subsets and buffer management for
                handling entity expansion would slow things down.
 * Simplicity:  if an external subset is referenced, it has to be
                available when the parser runs (if it's available
                to some but not all processors, different results
                are possible).
 * Security:    entity expansion introduces the possibility of DoS
                attacks; other security issues might arise

Although it was explicitly not a goal of the XML Protocol WG to
produce a subset of XML (independent of their own application needs),
this seems like a good place to start.

However, precisely how the subset is defined requires careful
consideration as this is an exercise that should be conducted only
once. The subset selected must be small enough so that no further
subset will be required but also complete enough to be useful for a
wide range of applications.

One clear requirement of the subset is that it must exclude internal
and external subsets (no <!DOCTYPE declaration is allowed). This
requirement effectively removes DTDs from XML and consequently removes
entities and notations.

What remains are elements, attributes, namespace declarations,
comments, processing instructions, and character data. While comments
and processing instructions might conceivably be removed, they are
sufficiently useful that we think they should remain. (Although the
SOAP spec forbids senders from including processing instructions, it
accepts that receivers might get them, so it's clear that removing
processing instructions from the subset is not a requirement of the
SOAP subset.)

Some people have proposed that what is really needed is a "subset
plus", that is a subset of XML with a new feature or two. The most
often requested feature in this regard is support for xml:id. I
feels strongly that it would be a mistake to introduce a single
new feature, or a single change of any sort that would not be
completely compatible with XML 1.1, in the work that subsets XML.
(Support for xml:id or any other feature is an orthogonal issue and
must not be conflated with the effort to define a subset, even if the
subset makes a particular feature more necessary or desirable.)

Along these lines, a number of people have suggested that the right
approach to this problem is to define a new recommendation that
combines the current suite of related recommendations (XML 1.1, XML
Infoset, Namespaces in XML, and perhaps XML Base) into a single
document. Tim Bray has demonstrated[4] one example of how this might
appear.

To the extent that this might be viewed as an editorial decision, one
that may offer tangible benefits to XML users and implementors,
particularly new users and implementors, but which makes no technical
changes to the languages defined by (and definable by) XML, this seems
not unreasonable.

However, it's clear that performing this "unification" exercise on all
of XML 1.1, without introducing any backwards incompatible changes,
may be an extraordinarily large editorial task, out of proportion with
the effort required simply to define the subset in some less invasive
way.

Conversely, defining only the subset in a unified document would be
easier but would introduce issues of its own. Doing so would
effectively split the XML specifications into two tracks: a unified
"subset" track and a non-unified "full XML" track. This could be the
source of considerable confusion and accidental divergence.

Further consideration of these technical and editorial issues, and the
eventual creation of a new recommendation, would seem to be within the
scope of the XML Core WG's charter[5] which reads, in part, "[the] WG
will also study the advisability of a version 2.0 of the XML
specification and may undertake the preparation of such a
specification, if deemed advisable."

In short, it appears that a new recommendation-track document that
defines a subset of XML 1.1 should be developed:

  * The subset must be backwards compatible with XML 1.1.
  * The subset must define a language that excludes DTD declarations

How the new recommendation is constructed we leave to the editorial
discretion of the group that undertakes it.

[1] http://www.w3.org/TR/REC-xml#sec-origin-goals
[2] http://www.w3.org/2000/xp/Group/2/11/08/soap12-part1.html#soapenv
[3] http://lists.w3.org/Archives/Public/www-tag/2002Dec/0119.html
[4] http://www.textuality.com/xml/xmlSW.html
[5] http://www.w3.org/2001/12/xmlbp/xml-core-wg-charter.html#deliverables

                                        Be seeing you,
                                          norm

- -- 
Norman.Walsh@Sun.COM    | There is a road from the eye to the heart
XML Standards Architect | that does not go through the intellect.--G.
Web Tech. and Standards | K. Chesterton
Sun Microsystems, Inc.  | 
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: Processed by Mailcrypt 3.5.7 <http://mailcrypt.sourceforge.net/>

iD8DBQE+JeMOOyltUcwYWjsRAmsuAJ49pGFH6nPSmZvEXQNrVZGr37plygCeOUJD
zHXfJSzY/7YujRkXD5o+yzo=
=3UHP
-----END PGP SIGNATURE-----

Received on Wednesday, 15 January 2003 17:40:51 UTC