[xmlProfiles-29] TAG recommendation for work on subset of XML 1.1

Liam,

This email concerns TAG issue xmlProfiles-29 [0]:

  "When, whither and how to profile W3C specifications
   in the XML Family"

Profiling XML, providing more implementation options, will
necessarily increase the possibility of interoperability problems
and it would be best to avoid doing so. Profiles are a bad idea
on general principles and are in direct conflict with one of the
original goals of XML[1]: "the number of optional features in XML
is to be kept to the absolute minimum, ideally zero."

Unfortunately, a number of user communities have expressed a need
to work with only a subset of XML. The TAG is concerned that if
these needs are not addressed quickly (and centrally), a number
of slightly different XML subsets will arise and if this trend
continues, the stability of XML as the basis of a whole range of
technologies could be jeopardized.

One way to avoid this problem is to produce a new Recommendation
that identifies a subset of XML for use in those environments
where supporting all of XML is not practical.

One obvious place where such a subset has been deployed is in
SOAP[2].  SOAP forbids internal and external subsets and strongly
discourages processing instructions.

When asked, the XML Protocol WG listed these[3] among their
reasons for subsetting:

  * Performance: processing internal subsets and buffer management
                 for handling entity expansion would slow things down.

  * Simplicity:  if an external subset is referenced, it has to be
                 available when the parser runs (if it's available
                 to some but not all processors, different results
                 are possible).

  * Security:    entity expansion introduces the possibility of
                 denial of service (DoS) attacks; other security
                 issues might arise.

Although it was explicitly not a goal of the XML Protocol WG to
produce a subset of XML (independent of their own application
needs), this seems like a good place to start.

However, precisely how the subset is defined requires careful
consideration as this is an exercise that should be conducted
only once. The subset selected must be small enough so that no
further subset will be required but also complete enough to be
useful for a wide range of applications.

One clear requirement of the subset is that it must exclude
internal and external subsets (no <!DOCTYPE declaration is
allowed). This requirement effectively removes DTDs from XML and
consequently removes entities and notations.

What remains are elements, attributes, namespace declarations,
comments, processing instructions, and character data. While
comments and processing instructions might conceivably be
removed, they are sufficiently useful that we think they should
remain.

Some people have proposed that what is really needed is a "subset
plus," that is a subset of XML with a new feature or two. The
most often requested feature in this regard is support for
xml:id. Others feel that it would be a mistake to design a
"subset plus" with any new feature incompatible with XML 1.1 (at
this time a W3C Candidate Recommendation). The TAG has not yet
reached consensus on how an XML subset Recommendation should
address the question of ids.  The TAG expects to address this
issue separately:

   Issue xmlIDSemantics-32: How should the problem of identifying
   ID semantics in XML languages be addressed in the absence of a
   DTD? [6]

A number of people have suggested that the right approach to this
problem is to define a new Recommendation that combines the
current suite of related recommendations (XML 1.1, XML Infoset,
Namespaces in XML, and perhaps XML Base) into a single
document. Tim Bray has demonstrated[4] one example of how this
might appear.

To the extent that this might be viewed as an editorial decision,
one that may offer tangible benefits to XML users and
implementors, particularly new users and implementors, but which
makes no technical changes to the languages defined by (and
definable by) XML, this seems not unreasonable.

However, it's clear that performing this "unification" exercise
on all of XML 1.1, without introducing any backwards incompatible
changes, may be an extraordinarily large editorial task, out of
proportion with the effort required simply to define the subset
in some less invasive way.

Conversely, defining only the subset in a unified document would
be easier but would introduce issues of its own. Doing so would
effectively split the XML specifications into two tracks: a
unified "subset" track and a non-unified "full XML" track. This
could be the source of considerable confusion and accidental
divergence.

Further consideration of these technical and editorial issues,
and the eventual creation of a new Recommendation, would seem to
be within the scope of the XML Core WG's charter[5] which reads,
in part, "[the] WG will also study the advisability of a version
2.0 of the XML specification and may undertake the preparation of
such a specification, if deemed advisable."

In short, it appears that a new Recommendation-track document
that defines a subset of XML 1.1 should be developed:

   * The subset must be backwards compatible with XML 1.1.
   * The subset must define a language that excludes DTD
     declarations.

How the new Recommendation is constructed we leave to the
editorial discretion of the group that undertakes it.

Thank you,

  - Ian Jacobs, for
    Norm Walsh, author of this summary, and
    Stuart Williams and Tim Berners-Lee, TAG co-Chairs

[0] http://www.w3.org/2001/tag/ilist#xmlProfiles-29
[1] http://www.w3.org/TR/REC-xml#sec-origin-goals
[2] http://www.w3.org/2000/xp/Group/2/11/08/soap12-part1#soapenv
[3] http://lists.w3.org/Archives/Public/www-tag/2002Dec/0119
[4] http://www.textuality.com/xml/xmlSW
[5] http://www.w3.org/2001/12/xmlbp/xml-core-wg-charter#deliverables
[6] http://www.w3.org/2001/tag/ilist#xmlIDSemantics-32

-- 
Ian Jacobs (ij@w3.org)   http://www.w3.org/People/Jacobs
Tel:                     +1 718 260-9447

Received on Thursday, 30 January 2003 16:45:58 UTC