RE: XMLP WG Response on "SOAP and the Internal Subset" from David Orchard on 2002-12-09 (www-tag@w3.org from December 2002)

From: David Orchard <dorchard@bea.com>
Date: Mon, 9 Dec 2002 10:17:54 -0800
To: "'David Fallside'" <fallside@us.ibm.com>, <www-tag@w3.org>, <pgrosso@arbortext.com>
Message-ID: <002001c29faf$4bc80c90$a20ba8c0@beasys.com>
Thank you very much David and XMLP team.  This is excellent.

Cheers,
Dave

> -----Original Message-----
> From: www-tag-request@w3.org 
> [mailto:www-tag-request@w3.org]On Behalf Of
> David Fallside
> Sent: Monday, December 09, 2002 10:02 AM
> To: www-tag@w3.org; pgrosso@arbortext.com
> Subject: XMLP WG Response on "SOAP and the Internal Subset"
> 
> 
> 
> 
> 
> 
> 
> The XML Protocols Workgroup appreciates this opportunity to 
> clarify our
> design decisions regarding use of XML features such as the 
> Internal Subset
> (for those not familiar with the term, "Internal Subset" is 
> the official
> term for a DTD that appears within an XML document).
> 
> Background
> ==========
> 
> Before discussing our (lack of) use of DTDs, it's helpful to briefly
> clarify what SOAP is, as well as to review some use cases 
> that influenced
> our decision making.  The following are not necessarily 
> official use cases
> and requirements, but they are representative of the 
> considerations that
> many implementers considered important:
> 
> Informally, SOAP is a specification that describes certain 
> aspects of the
> creation, transmission, and processing of messages.  SOAP messages
> originate at a node called the "initial sender", flow along a 
> message path
> through zero or more "intermediary" nodes, eventually reaching (in the
> absence of errors) an "ultimate receiver".  SOAP sets out the 
> rules for
> initial construction of a message, rules by which messages 
> are processed
> when received at an intermediary or ultimate destination, and rules by
> which portions of the message can be inserted, deleted  or 
> modified by the
> actions of an intermediary.  Thus, SOAP deals not just with 
> the messages
> transiting a given hop, but with the manipulation of those 
> messages as they
> go through successive intermediaries.
> 
> SOAP is a framework that's intended to be useable for a broad range of
> applications, on a variety of devices, and in a broad range 
> of performance
> regimes.  Among the goals is for SOAP to be useable as a 
> replacement for
> certain high performance binary protocols such as EDI, at 
> least in certain
> applications.  Accordingly, the ability to run in a 
> performance regime of
> hundreds or thousands of messages per second per node is 
> highly desirable.
> 
> SOAP is designed to be hostable on a variety of so-called underlying
> protocols.  A binding to HTTP is provided and we expect that 
> it will be
> widely deployed, but the specification provides the 
> mechanisms necessary
> for users (or the W3C) to create bindings to other protocols, 
> or to create
> alternative bindings to HTTP.
> 
> Message Infosets and Bindings
> =============================
> 
> SOAP messages are specified as XML Infosets -- see [1] (Note, 
> references
> are to a snapshot of the latest editors' copies of the SOAP specs,
> reflecting some resolutions to last call issues.  I believe 
> the version I
> am referencing is the latest that is stable in W3C URI "date 
> space".  It is
> later than our last official WD.)  The initial sender prepares a SOAP
> message in the form of what the Infoset Recommendation calls 
> a "synthetic
> infoset" [2].  In other words, the initial sender typically 
> does not have a
> document to parse to produce the infoset;  rather, the initial sender
> establishes, using programming structures of its choosing (could be
> something like DOM or SAX) the elements, attributes and other 
> content of
> the outgoing message.
> 
> The purpose of a binding, such as the HTTP binding, is to 
> provide a means
> for moving the message Infoset from one node to the next.  
> The way in which
> the message is represented on the wire is completely at the 
> discretion of
> the binding, and is not otherwise visible in the 
> architecture.  The HTTP
> binding supplied with SOAP uses an XML 1.0 serialization of 
> the Infoset. It
> sends that serialization in an HTTP POST or RESPONSE, 
> typically as MIME
> type application/soap+xml.
> 
> Note that, because SOAP is Infoset based, in a situation 
> where two nodes
> share a memory (run on the same processor or tightly coupled 
> MP), it is
> perfectly sensible to build a binding that does its work by 
> just passing
> around DOMs, SAX streams, or other in-memory representations of the
> Infoset.  In these cases, no serialization or parsing need 
> ever be done.
> Also:  implementations can in principle use compressed or 
> encrypted forms,
> possibly by compressing or encrypting the <...> 
> serialization, but also
> possibly by using other compressed or encrypted representations.   In
> principle bindings could also be written to send parts of the 
> Infoset out
> of order, in parallel over multiple links to improve 
> bandwidth on large
> messages, etc.
> 
> Use of DTD Internal Subsets
> ===========================
> 
> Thus, we must consider several related issues:
> 
> Q. Do DTD internal subsets or other DTD-related features 
> appear in a SOAP
> message Infoset?
> A. By definition, they do not.  See [1], which says:
> 
>       "A SOAP message is specified as an XML Infoset
>       that consists of a document information item with
>       exactly one member in its [children] property,
>       which MUST be the SOAP Envelope element
>       information item (see 5.1 SOAP Envelope). This
>       element information item is also the value of the
>       [document element] property. The [notations] and
>       [unparsed entities] properties are both empty. The
>       [base URI], [character encoding scheme] and
>       [version] properties can have any legal value. The
>       [standalone] property either has a value of "yes"
>       or has no value.
> 
>       The XML infoset of a SOAP message MUST NOT contain
>       a document type declaration information item."
> 
> So, to the extent the Infoset recommendation is capable of 
> reflecting the
> presence of DTDs, SOAP rules them out.  SOAP messages do not 
> contain DTDs.
> SOAP messages also must not reference external DTDs.
> 
> Q. Can DTD's or schema validation be used to supply defaults 
> or otherwise
> augment or alter the contents of a SOAP message?
> A. No, not insofar as such augmentation would change the 
> results of SOAP
> processing.  SOAP makes clear that the values of all elements and
> attributes pertinent to SOAP itself must be carried explicitly in each
> message -- neither Schema nor DTD (nor any other) validation 
> can be used to
> establish defaults for SOAP's attributes, though in certain cases SOAP
> directly defines what the behavior will be if optional 
> attributes are left
> out.  That said, applications can do whatever they want with 
> data received
> from SOAP bodies or header entries.  If an application 
> chooses to infer
> information from schema validation of information received in a SOAP
> message, that is the business of the application.
> 
> Q. Can a binding use DTDs in its "on the wire" format?
> A. In principle, yes.  Somebody could write a binding that, 
> for example,
> declares entities in an internal subset, perhaps to represent commonly
> appearing substrings, and could call for their expansion upon receipt.
> Note, however, that such use of a DTD must be completely 
> private to the
> binding; upon receipt an Infoset must in all cases be 
> reconstructed to be
> identical to the one provided for transmission, and by 
> definition that does
> not contain a DTD (see above).
> 
> Q. Does the HTTP binding provided with SOAP use DTDs as 
> described above?
> A. No.  The SOAP HTTP binding uses the obvious no DTD 
> serialization of the
> SOAP message Infoset.
> 
> Q. If a DTD is present and the SOAP HTTP binding is used, what does a
> receiving node do?
> A. If an implementation of the SOAP HTTP binding receives a 
> message that
> contains a DTD, then it knows that it is talking to an erroneous
> implementation at the sender.  It SHOULD send a so-called 
> env:SENDER fault.
> 
> Why did we make these decisions?
> =================================
> 
> That's how SOAP works.  The question is, of course: why?  
> Primarily, the
> reasons are (a) performance and (b) keep it simple.  In the high
> performance regimes where some SOAP implementations will operate, the
> parsers will likely be tuned for SOAP message handling.  Doing general
> entity substitution beyond that mandated by XML 1.0 (e.g. 
> &lt;) implies a
> degree of buffer management, often data copying, etc. which can be a
> noticeable burden when going for truly high performance.  
> This performance
> effect has been reported by workgroup members who are building high
> performance SOAP implementations.
> 
> Furthermore, a DTD in the Infoset would become another piece of the
> message.  We would have questions to answer:  what are the rules for
> relaying through an intermediary?  If something comes into an 
> intermediary
> as an entity reference, must it go out as an entity 
> reference?  If that
> header is removed by the intermediary, must one check whether 
> it is the
> last use of the entity and should the outbound DTD have the definition
> removed?  What does all this do to digital signatures?  If we 
> allowed an
> internal subset, should we change our rules to allow attributes to be
> defaulted?   All of this is complication.  So, in addition to 
> performance,
> leaving out DTDs keeps things simpler, which by the way tends to avoid
> other performance problems.
> 
> Security is another concern.  Although we have not formally 
> demonstrated
> that XML with internal subset is less secure, several members of the
> workgroup shared an intuition that entity substitution, attribute
> defaulting, and other manipulation of the message content was 
> more likely
> to lead to security exposures, denial of service attacks 
> (e.g. the billion
> laughs entity attack), etc.
> 
> Our reasons for disallowing reference to external DTDs were similar to
> those given above for the internal subset.  In addition, we 
> felt that it
> would not in general be appropriate to require a SOAP 
> processor to open a
> connection to the Web in order to retrieve external DTDs.
> 
> Of course, the counter argument to all this is:  XML allows internal
> subsets and external subsets, lots of off the shelf parsers 
> would implement
> them for you, and indeed some might not report the presence 
> of the DTD at
> all.  First of all, SOAP is not the only application of XML 
> that requires
> parsers to report the presence of DTDs.  Surely an XML editor would as
> well.  Indeed, there is no W3C specification for what a 
> general purpose
> processor must be, just for what XML is.  It is important to 
> note that our
> HTTP binding does go to some trouble to ensure that all messages are
> XML-conformant.  You CAN parse all legal SOAP messages from our HTTP
> binding with any XML processor.  If your processor doesn't report the
> presence of DTDs or entity references, then you have an error checking
> problem.  Get a processor that meets your needs.  Again, many high
> performance SOAP implementations will have highly optimized parser
> implementations tuned for SOAP...our choices are designed in 
> part to make
> such implementations practical.
> 
> Still, we are aware of the trade-off:  our decision to limit use of
> constructions such as the internal subset is likely to reduce the
> performance of and otherwise negatively impact implementations and
> applications which would have otherwise been able to use 
> certain general
> purpose processors;  in many cases, those implementations will have to
> resort to additional scanning and reporting to deal with the 
> features that
> we disallow.
> 
> Does SOAP define an XML Subset for the Rest of the World?
> =========================================================
> 
> Maybe, but that certainly wasn't a goal, and there's some reason for
> caution.  SOAP places other restrictions on its use of XML. 
> For example
> (again from [1]):
> 
> "SOAP messages sent by initial SOAP senders MUST NOT contain 
> processing
> instruction information items. SOAP intermediaries MUST NOT insert
> processing instruction information items in SOAP messages 
> they relay. SOAP
> receivers receiving a SOAP message containing a processing instruction
> information item SHOULD generate a SOAP fault with the Value 
> of Code set to
> "env:Sender". However, in the case where performance 
> considerations make it
> impractical for an intermediary to detect processing instruction
> information items in a message to be relayed, the 
> intermediary MAY leave
> such processing instruction information items unchanged in the relayed
> message."
> 
> This was the subject of long debate on distApp and in the 
> working group,
> and this is not the place to reopen that debate.  To give 
> some flavor of
> the reasons why PIs are a problem consider the following SOAP 
> fragment:
> 
> <soap:Envelope>
>   <soap:Header>
>       <ns1:h1> ... </ns1:h1>
>       <? your pi here -- does it modify ns2:h2 below ?>
>       <ns2:h2> ... </ns2:h2>
>       <ns3:h3> ... </ns3:h3>
>   </soap:Header>
>   <soap:Body>
>       ...
>   </soap:Body>
> </soap:Envelope>
> 
> Consider an intermediary that processes and removes ns2:h2, the second
> header.  Should it also remove the PI above when relaying the 
> message to
> the next node?  The PI might well be giving information about 
> the element
> to follow, or else it might not.   If we leave it in place, 
> does it wind up
> inadvertently modifying the third header?  The point is that 
> any feature
> like PIs adds complication.  SOAP bases all of its processing 
> and semantics
> on the tree of elements.  The fact that PIs are not tied to 
> that tree in an
> architecturally robust manner makes it very hard to define 
> simple or stable
> semantics for PI's as a SOAP message flows through a system.  
> Furthermore,
> we would have other complications in the WS stack:  should 
> WSDL provide
> rules to describe when PIs are OK and when not?  Which PIs?  With what
> parameters?  Another mess.  Again, we kept it simple by 
> ruling them out.
> 
> Summary
> =======
> 
> SOAP uses XML Infosets and serializations to build a framework for
> messaging.  By definition, SOAP envelope Infosets do not 
> contain DTDs or
> entity references, and external DTDs are disallowed as well.  
> SOAP uses
> pluggable bindings to move messages on the wire;  those bindings have
> complete discretion as to how to represent the data.  Some 
> might try to
> play games using DOCTYPEs and DTDs on the wire, but our standard HTTP
> binding does not, and it's probably unlikely that others would.
> 
> Few XML applications use all the features of XML (some don't use
> attributes), but clearly SOAP eschews some features such as 
> DTDs and PIs
> that are often viewed as relatively general purpose.  This 
> note sets out
> some of our reasons.  All SOAP messages are conformant XML 
> Infosets.  All
> messages sent by our HTTP binding are conformant XML 1.0 and 
> can if desired
> be processed with conformant processors.  Like an XML editor, 
> SOAP depends
> on knowing whether DTDs and PIs are in its XML (in our case, 
> though, only
> for error checking.)  SOAP messages also tend to be processable at
> relatively high speed by carefully tuned processors.  Furthermore, by
> prohibiting some of these features, we simplified the 
> definition of the
> SOAP processing model and of description languages used with 
> SOAP.  The
> tradeoff is that we have somewhat complicated things for 
> those who prefer
> to use certain off-the-shelf processors, and for those who 
> want to insert
> arbitrary XML into SOAP messages (there are many other problems doing
> that...a longer story than we have time for here.)
> 
> Whether SOAP represents a good start on a general purpose 
> subset of XML is
> not a question the XMLP group has actively considered.  That was not a
> goal.  We consider SOAP to be an application of XML, not a 
> redefinition of
> it.  We do hope the analysis above is useful to those who are indeed
> thinking about XML subsets, and that it clarifies the reasons for our
> decisions.
> 
> 
> [1] http://www.w3.org/2000/xp/Group/2/11/08/soap12-part1.html#soapenv
> [2] http://www.w3.org/TR/xml-infoset/#intro.synthetic
> [3]
> http://www.ietf.org/internet-drafts/draft-hollenbeck-ietf-xml-
guidelines-07.txt
Received on Monday, 9 December 2002 13:19:28 UTC