Re: Draft of position on SOAP's use of XML Internal subset from noah_mendelsohn@us.ibm.com on 2002-12-06 (xml-dist-app@w3.org from December 2002)

From: <noah_mendelsohn@us.ibm.com>
Date: Fri, 6 Dec 2002 15:29:50 -0500
To: fallside@us.ibm.com
Cc: xml-dist-app@w3.org
Message-ID: <OF0A22FAAF.FE3EDD0C-ON85256C87.006614B4@lotus.com>
I promised David a final draft by COB today.  Here it is.  Unless there are
problems, I expect David will send this out officially on our behalf later
today.

==========Start of Draft============
The XML Protocols Workgroup appreciates this opportunity to clarify our
design decisions regarding use of XML features such as the Internal Subset
(for those not familiar with the term, "Internal Subset" is the official
term for a DTD that appears within an XML document).

Background
==========

Before discussing our (lack of) use of DTDs, it's helpful to briefly
clarify what SOAP is, as well as to review some use cases that influenced
our decision making.  The following are not necessarily official use cases
and requirements, but they are representative of the considerations that
many implementers considered important:

Informally, SOAP is a specification that describes certain aspects of the
creation, transmission, and processing of messages.  SOAP messages
originate at a node called the "initial sender", flow along a message path
through zero or more "intermediary" nodes, eventually reaching (in the
absence of errors) an "ultimate receiver".  SOAP sets out the rules for
initial construction of a message, rules by which messages are processed
when received at an intermediary or ultimate destination, and rules by
which portions of the message can be inserted, deleted  or modified by the
actions of an intermediary.  Thus, SOAP deals not just with the messages
transiting a given hop, but with the manipulation of those messages as they
go through successive intermediaries.

SOAP is a framework that's intended to be useable for a broad range of
applications, on a variety of devices, and in a broad range of performance
regimes.  Among the goals is for SOAP to be useable as a replacement for
certain high performance binary protocols such as EDI, at least in certain
applications.  Accordingly, the ability to run in a performance regime of
hundreds or thousands of messages per second per node is highly desirable.

SOAP is designed to be hostable on a variety of so-called underlying
protocols.  A binding to HTTP is provided and we expect that it will be
widely deployed, but the specification provides the mechanisms necessary
for users (or the W3C) to create bindings to other protocols, or to create
alternative bindings to HTTP.

Message Infosets and Bindings
=============================

SOAP messages are specified as XML Infosets -- see [1] (Note, references
are to a snapshot of the latest editors' copies of the SOAP specs,
reflecting some resolutions to last call issues.  I believe the version I
am referencing is the latest that is stable in W3C URI "date space".  It is
later than our last official WD.)  The initial sender prepares a SOAP
message in the form of what the Infoset Recommendation calls a "synthetic
infoset" [2].  In other words, the initial sender typically does not have a
document to parse to produce the infoset;  rather, the initial sender
establishes, using programming structures of its choosing (could be
something like DOM or SAX) the elements, attributes and other content of
the outgoing message.

The purpose of a binding, such as the HTTP binding, is to provide a means
for moving the message Infoset from one node to the next.  The way in which
the message is represented on the wire is completely at the discretion of
the binding, and is not otherwise visible in the architecture.  The HTTP
binding supplied with SOAP uses an XML 1.0 serialization of the Infoset. It
sends that serialization in an HTTP POST or RESPONSE, typically as MIME
type application/soap+xml.

Note that, because SOAP is Infoset based, in a situation where two nodes
share a memory (run on the same processor or tightly coupled MP), it is
perfectly sensible to build a binding that does its work by just passing
around DOMs, SAX streams, or other in-memory representations of the
Infoset.  In these cases, no serialization or parsing need ever be done.
Also:  implementations can in principle use compressed or encrypted forms,
possibly by compressing or encrypting the <...> serialization, but also
possibly by using other compressed or encrypted representations.   In
principle bindings could also be written to send parts of the Infoset out
of order, in parallel over multiple links to improve bandwidth on large
messages, etc.

Use of DTD Internal Subsets
===========================

Thus, we must consider several related issues:

Q. Do DTD internal subsets or other DTD-related features appear in a SOAP
message Infoset?
A. By definition, they do not.  See [1], which says:

      "A SOAP message is specified as an XML Infoset
      that consists of a document information item with
      exactly one member in its [children] property,
      which MUST be the SOAP Envelope element
      information item (see 5.1 SOAP Envelope). This
      element information item is also the value of the
      [document element] property. The [notations] and
      [unparsed entities] properties are both empty. The
      [base URI], [character encoding scheme] and
      [version] properties can have any legal value. The
      [standalone] property either has a value of "yes"
      or has no value.

      The XML infoset of a SOAP message MUST NOT contain
      a document type declaration information item."

So, to the extent the Infoset recommendation is capable of reflecting the
presence of DTDs, SOAP rules them out.  SOAP messages do not contain DTDs.
SOAP messages also must not reference external DTDs.

Q. Can DTD's or schema validation be used to supply defaults or otherwise
augment or alter the contents of a SOAP message?
A. No, not insofar as such augmentation would change the results of SOAP
processing.  SOAP makes clear that the values of all elements and
attributes pertinent to SOAP itself must be carried explicitly in each
message -- neither Schema nor DTD (nor any other) validation can be used to
establish defaults for SOAP's attributes, though in certain cases SOAP
directly defines what the behavior will be if optional attributes are left
out.  That said, applications can do whatever they want with data received
from SOAP bodies or header entries.  If an application chooses to infer
information from schema validation of information received in a SOAP
message, that is the business of the application.

Q. Can a binding use DTDs in its "on the wire" format?
A. In principle, yes.  Somebody could write a binding that, for example,
declares entities in an internal subset, perhaps to represent commonly
appearing substrings, and could call for their expansion upon receipt.
Note, however, that such use of a DTD must be completely private to the
binding; upon receipt an Infoset must in all cases be reconstructed to be
identical to the one provided for transmission, and by definition that does
not contain a DTD (see above).

Q. Does the HTTP binding provided with SOAP use DTDs as described above?
A. No.  The SOAP HTTP binding uses the obvious no DTD serialization of the
SOAP message Infoset.

Q. If a DTD is present and the SOAP HTTP binding is used, what does a
receiving node do?
A. If an implementation of the SOAP HTTP binding receives a message that
contains a DTD, then it knows that it is talking to an erroneous
implementation at the sender.  It SHOULD send a so-called env:SENDER fault.

Why did we make these decisions?
=================================

That's how SOAP works.  The question is, of course: why?  Primarily, the
reasons are (a) performance and (b) keep it simple.  In the high
performance regimes where some SOAP implementations will operate, the
parsers will likely be tuned for SOAP message handling.  Doing general
entity substitution beyond that mandated by XML 1.0 (e.g. &lt;) implies a
degree of buffer management, often data copying, etc. which can be a
noticeable burden when going for truly high performance.  This performance
effect has been reported by workgroup members who are building high
performance SOAP implementations.

Furthermore, a DTD in the Infoset would become another piece of the
message.  We would have questions to answer:  what are the rules for
relaying through an intermediary?  If something comes into an intermediary
as an entity reference, must it go out as an entity reference?  If that
header is removed by the intermediary, must one check whether it is the
last use of the entity and should the outbound DTD have the definition
removed?  What does all this do to digital signatures?  If we allowed an
internal subset, should we change our rules to allow attributes to be
defaulted?   All of this is complication.  So, in addition to performance,
leaving out DTDs keeps things simpler, which by the way tends to avoid
other performance problems.

Security is another concern.  Although we have not formally demonstrated
that XML with internal subset is less secure, several members of the
workgroup shared an intuition that entity substitution, attribute
defaulting, and other manipulation of the message content was more likely
to lead to security exposures, denial of service attacks (e.g. the billion
laughs entity attack), etc.

Our reasons for disallowing reference to external DTDs were similar to
those given above for the internal subset.  In addition, we felt that it
would not in general be appropriate to require a SOAP processor to open a
connection to the Web in order to retrieve external DTDs.

Of course, the counter argument to all this is:  XML allows internal
subsets and external subsets, lots of off the shelf parsers would implement
them for you, and indeed some might not report the presence of the DTD at
all.  First of all, SOAP is not the only application of XML that requires
parsers to report the presence of DTDs.  Surely an XML editor would as
well.  Indeed, there is no W3C specification for what a general purpose
processor must be, just for what XML is.  It is important to note that our
HTTP binding does go to some trouble to ensure that all messages are
XML-conformant.  You CAN parse all legal SOAP messages from our HTTP
binding with any XML processor.  If your processor doesn't report the
presence of DTDs or entity references, then you have an error checking
problem.  Get a processor that meets your needs.  Again, many high
performance SOAP implementations will have highly optimized parser
implementations tuned for SOAP...our choices are designed in part to make
such implementations practical.

Still, we are aware of the trade-off:  our decision to limit use of
constructions such as the internal subset is likely to reduce the
performance of and otherwise negatively impact implementations and
applications which would have otherwise been able to use certain general
purpose processors;  in many cases, those implementations will have to
resort to additional scanning and reporting to deal with the features that
we disallow.

Does SOAP define an XML Subset for the Rest of the World?
=========================================================

Maybe, but that certainly wasn't a goal, and there's some reason for
caution.  SOAP places other restrictions on its use of XML. For example
(again from [1]):

"SOAP messages sent by initial SOAP senders MUST NOT contain processing
instruction information items. SOAP intermediaries MUST NOT insert
processing instruction information items in SOAP messages they relay. SOAP
receivers receiving a SOAP message containing a processing instruction
information item SHOULD generate a SOAP fault with the Value of Code set to
"env:Sender". However, in the case where performance considerations make it
impractical for an intermediary to detect processing instruction
information items in a message to be relayed, the intermediary MAY leave
such processing instruction information items unchanged in the relayed
message."

This was the subject of long debate on distApp and in the working group,
and this is not the place to reopen that debate.  To give some flavor of
the reasons why PIs are a problem consider the following SOAP fragment:

<soap:Envelope>
  <soap:Header>
      <ns1:h1> ... </ns1:h1>
      <? your pi here -- does it modify ns2:h2 below ?>
      <ns2:h2> ... </ns2:h2>
      <ns3:h3> ... </ns3:h3>
  </soap:Header>
  <soap:Body>
      ...
  </soap:Body>
</soap:Envelope>

Consider an intermediary that processes and removes ns2:h2, the second
header.  Should it also remove the PI above when relaying the message to
the next node?  The PI might well be giving information about the element
to follow, or else it might not.   If we leave it in place, does it wind up
inadvertently modifying the third header?  The point is that any feature
like PIs adds complication.  SOAP bases all of its processing and semantics
on the tree of elements.  The fact that PIs are not tied to that tree in an
architecturally robust manner makes it very hard to define simple or stable
semantics for PI's as a SOAP message flows through a system.  Furthermore,
we would have other complications in the WS stack:  should WSDL provide
rules to describe when PIs are OK and when not?  Which PIs?  With what
parameters?  Another mess.  Again, we kept it simple by ruling them out.

Summary
=======

SOAP uses XML Infosets and serializations to build a framework for
messaging.  By definition, SOAP envelope Infosets do not contain DTDs or
entity references, and external DTDs are disallowed as well.  SOAP uses
pluggable bindings to move messages on the wire;  those bindings have
complete discretion as to how to represent the data.  Some might try to
play games using DOCTYPEs and DTDs on the wire, but our standard HTTP
binding does not, and it's probably unlikely that others would.

Few XML applications use all the features of XML (some don't use
attributes), but clearly SOAP eschews some features such as DTDs and PIs
that are often viewed as relatively general purpose.  This note sets out
some of our reasons.  All SOAP messages are conformant XML Infosets.  All
messages sent by our HTTP binding are conformant XML 1.0 and can if desired
be processed with conformant processors.  Like an XML editor, SOAP depends
on knowing whether DTDs and PIs are in its XML (in our case, though, only
for error checking.)  SOAP messages also tend to be processable at
relatively high speed by carefully tuned processors.  Furthermore, by
prohibiting some of these features, we simplified the definition of the
SOAP processing model and of description languages used with SOAP.  The
tradeoff is that we have somewhat complicated things for those who prefer
to use certain off-the-shelf processors, and for those who want to insert
arbitrary XML into SOAP messages (there are many other problems doing
that...a longer story than we have time for here.)

Whether SOAP represents a good start on a general purpose subset of XML is
not a question the XMLP group has actively considered.  That was not a
goal.  We consider SOAP to be an application of XML, not a redefinition of
it.  We do hope the analysis above is useful to those who are indeed
thinking about XML subsets, and that it clarifies the reasons for our
decisions.

Noah Mendelsohn
- for the XML Protocols WG -

P.S.  Although it played no role that I am aware in the actual decision
making of the XMLP team, I'm indebted to Rich Salz for pointing out that
the internet draft on "Guidelines for the Use of XML within IETF Protocols"
[3] has some useful perspectives on related issues.


[1] http://www.w3.org/2000/xp/Group/2/11/08/soap12-part1.html#soapenv
[2] http://www.w3.org/TR/xml-infoset/#intro.synthetic
[3]
http://www.ietf.org/internet-drafts/draft-hollenbeck-ietf-xml-guidelines-07.txt

===========End of Draft=============

------------------------------------------------------------------
Noah Mendelsohn                              Voice: 1-617-693-4036
IBM Corporation                                Fax: 1-617-693-8676
One Rogers Street
Cambridge, MA 02142
------------------------------------------------------------------
Received on Friday, 6 December 2002 15:33:53 UTC