- From: David Orchard <dorchard@bea.com>
- Date: Mon, 9 Dec 2002 10:17:54 -0800
- To: "'David Fallside'" <fallside@us.ibm.com>, <www-tag@w3.org>, <pgrosso@arbortext.com>
Thank you very much David and XMLP team. This is excellent. Cheers, Dave > -----Original Message----- > From: www-tag-request@w3.org > [mailto:www-tag-request@w3.org]On Behalf Of > David Fallside > Sent: Monday, December 09, 2002 10:02 AM > To: www-tag@w3.org; pgrosso@arbortext.com > Subject: XMLP WG Response on "SOAP and the Internal Subset" > > > > > > > > The XML Protocols Workgroup appreciates this opportunity to > clarify our > design decisions regarding use of XML features such as the > Internal Subset > (for those not familiar with the term, "Internal Subset" is > the official > term for a DTD that appears within an XML document). > > Background > ========== > > Before discussing our (lack of) use of DTDs, it's helpful to briefly > clarify what SOAP is, as well as to review some use cases > that influenced > our decision making. The following are not necessarily > official use cases > and requirements, but they are representative of the > considerations that > many implementers considered important: > > Informally, SOAP is a specification that describes certain > aspects of the > creation, transmission, and processing of messages. SOAP messages > originate at a node called the "initial sender", flow along a > message path > through zero or more "intermediary" nodes, eventually reaching (in the > absence of errors) an "ultimate receiver". SOAP sets out the > rules for > initial construction of a message, rules by which messages > are processed > when received at an intermediary or ultimate destination, and rules by > which portions of the message can be inserted, deleted or > modified by the > actions of an intermediary. Thus, SOAP deals not just with > the messages > transiting a given hop, but with the manipulation of those > messages as they > go through successive intermediaries. > > SOAP is a framework that's intended to be useable for a broad range of > applications, on a variety of devices, and in a broad range > of performance > regimes. Among the goals is for SOAP to be useable as a > replacement for > certain high performance binary protocols such as EDI, at > least in certain > applications. Accordingly, the ability to run in a > performance regime of > hundreds or thousands of messages per second per node is > highly desirable. > > SOAP is designed to be hostable on a variety of so-called underlying > protocols. A binding to HTTP is provided and we expect that > it will be > widely deployed, but the specification provides the > mechanisms necessary > for users (or the W3C) to create bindings to other protocols, > or to create > alternative bindings to HTTP. > > Message Infosets and Bindings > ============================= > > SOAP messages are specified as XML Infosets -- see [1] (Note, > references > are to a snapshot of the latest editors' copies of the SOAP specs, > reflecting some resolutions to last call issues. I believe > the version I > am referencing is the latest that is stable in W3C URI "date > space". It is > later than our last official WD.) The initial sender prepares a SOAP > message in the form of what the Infoset Recommendation calls > a "synthetic > infoset" [2]. In other words, the initial sender typically > does not have a > document to parse to produce the infoset; rather, the initial sender > establishes, using programming structures of its choosing (could be > something like DOM or SAX) the elements, attributes and other > content of > the outgoing message. > > The purpose of a binding, such as the HTTP binding, is to > provide a means > for moving the message Infoset from one node to the next. > The way in which > the message is represented on the wire is completely at the > discretion of > the binding, and is not otherwise visible in the > architecture. The HTTP > binding supplied with SOAP uses an XML 1.0 serialization of > the Infoset. It > sends that serialization in an HTTP POST or RESPONSE, > typically as MIME > type application/soap+xml. > > Note that, because SOAP is Infoset based, in a situation > where two nodes > share a memory (run on the same processor or tightly coupled > MP), it is > perfectly sensible to build a binding that does its work by > just passing > around DOMs, SAX streams, or other in-memory representations of the > Infoset. In these cases, no serialization or parsing need > ever be done. > Also: implementations can in principle use compressed or > encrypted forms, > possibly by compressing or encrypting the <...> > serialization, but also > possibly by using other compressed or encrypted representations. In > principle bindings could also be written to send parts of the > Infoset out > of order, in parallel over multiple links to improve > bandwidth on large > messages, etc. > > Use of DTD Internal Subsets > =========================== > > Thus, we must consider several related issues: > > Q. Do DTD internal subsets or other DTD-related features > appear in a SOAP > message Infoset? > A. By definition, they do not. See [1], which says: > > "A SOAP message is specified as an XML Infoset > that consists of a document information item with > exactly one member in its [children] property, > which MUST be the SOAP Envelope element > information item (see 5.1 SOAP Envelope). This > element information item is also the value of the > [document element] property. The [notations] and > [unparsed entities] properties are both empty. The > [base URI], [character encoding scheme] and > [version] properties can have any legal value. The > [standalone] property either has a value of "yes" > or has no value. > > The XML infoset of a SOAP message MUST NOT contain > a document type declaration information item." > > So, to the extent the Infoset recommendation is capable of > reflecting the > presence of DTDs, SOAP rules them out. SOAP messages do not > contain DTDs. > SOAP messages also must not reference external DTDs. > > Q. Can DTD's or schema validation be used to supply defaults > or otherwise > augment or alter the contents of a SOAP message? > A. No, not insofar as such augmentation would change the > results of SOAP > processing. SOAP makes clear that the values of all elements and > attributes pertinent to SOAP itself must be carried explicitly in each > message -- neither Schema nor DTD (nor any other) validation > can be used to > establish defaults for SOAP's attributes, though in certain cases SOAP > directly defines what the behavior will be if optional > attributes are left > out. That said, applications can do whatever they want with > data received > from SOAP bodies or header entries. If an application > chooses to infer > information from schema validation of information received in a SOAP > message, that is the business of the application. > > Q. Can a binding use DTDs in its "on the wire" format? > A. In principle, yes. Somebody could write a binding that, > for example, > declares entities in an internal subset, perhaps to represent commonly > appearing substrings, and could call for their expansion upon receipt. > Note, however, that such use of a DTD must be completely > private to the > binding; upon receipt an Infoset must in all cases be > reconstructed to be > identical to the one provided for transmission, and by > definition that does > not contain a DTD (see above). > > Q. Does the HTTP binding provided with SOAP use DTDs as > described above? > A. No. The SOAP HTTP binding uses the obvious no DTD > serialization of the > SOAP message Infoset. > > Q. If a DTD is present and the SOAP HTTP binding is used, what does a > receiving node do? > A. If an implementation of the SOAP HTTP binding receives a > message that > contains a DTD, then it knows that it is talking to an erroneous > implementation at the sender. It SHOULD send a so-called > env:SENDER fault. > > Why did we make these decisions? > ================================= > > That's how SOAP works. The question is, of course: why? > Primarily, the > reasons are (a) performance and (b) keep it simple. In the high > performance regimes where some SOAP implementations will operate, the > parsers will likely be tuned for SOAP message handling. Doing general > entity substitution beyond that mandated by XML 1.0 (e.g. > <) implies a > degree of buffer management, often data copying, etc. which can be a > noticeable burden when going for truly high performance. > This performance > effect has been reported by workgroup members who are building high > performance SOAP implementations. > > Furthermore, a DTD in the Infoset would become another piece of the > message. We would have questions to answer: what are the rules for > relaying through an intermediary? If something comes into an > intermediary > as an entity reference, must it go out as an entity > reference? If that > header is removed by the intermediary, must one check whether > it is the > last use of the entity and should the outbound DTD have the definition > removed? What does all this do to digital signatures? If we > allowed an > internal subset, should we change our rules to allow attributes to be > defaulted? All of this is complication. So, in addition to > performance, > leaving out DTDs keeps things simpler, which by the way tends to avoid > other performance problems. > > Security is another concern. Although we have not formally > demonstrated > that XML with internal subset is less secure, several members of the > workgroup shared an intuition that entity substitution, attribute > defaulting, and other manipulation of the message content was > more likely > to lead to security exposures, denial of service attacks > (e.g. the billion > laughs entity attack), etc. > > Our reasons for disallowing reference to external DTDs were similar to > those given above for the internal subset. In addition, we > felt that it > would not in general be appropriate to require a SOAP > processor to open a > connection to the Web in order to retrieve external DTDs. > > Of course, the counter argument to all this is: XML allows internal > subsets and external subsets, lots of off the shelf parsers > would implement > them for you, and indeed some might not report the presence > of the DTD at > all. First of all, SOAP is not the only application of XML > that requires > parsers to report the presence of DTDs. Surely an XML editor would as > well. Indeed, there is no W3C specification for what a > general purpose > processor must be, just for what XML is. It is important to > note that our > HTTP binding does go to some trouble to ensure that all messages are > XML-conformant. You CAN parse all legal SOAP messages from our HTTP > binding with any XML processor. If your processor doesn't report the > presence of DTDs or entity references, then you have an error checking > problem. Get a processor that meets your needs. Again, many high > performance SOAP implementations will have highly optimized parser > implementations tuned for SOAP...our choices are designed in > part to make > such implementations practical. > > Still, we are aware of the trade-off: our decision to limit use of > constructions such as the internal subset is likely to reduce the > performance of and otherwise negatively impact implementations and > applications which would have otherwise been able to use > certain general > purpose processors; in many cases, those implementations will have to > resort to additional scanning and reporting to deal with the > features that > we disallow. > > Does SOAP define an XML Subset for the Rest of the World? > ========================================================= > > Maybe, but that certainly wasn't a goal, and there's some reason for > caution. SOAP places other restrictions on its use of XML. > For example > (again from [1]): > > "SOAP messages sent by initial SOAP senders MUST NOT contain > processing > instruction information items. SOAP intermediaries MUST NOT insert > processing instruction information items in SOAP messages > they relay. SOAP > receivers receiving a SOAP message containing a processing instruction > information item SHOULD generate a SOAP fault with the Value > of Code set to > "env:Sender". However, in the case where performance > considerations make it > impractical for an intermediary to detect processing instruction > information items in a message to be relayed, the > intermediary MAY leave > such processing instruction information items unchanged in the relayed > message." > > This was the subject of long debate on distApp and in the > working group, > and this is not the place to reopen that debate. To give > some flavor of > the reasons why PIs are a problem consider the following SOAP > fragment: > > <soap:Envelope> > <soap:Header> > <ns1:h1> ... </ns1:h1> > <? your pi here -- does it modify ns2:h2 below ?> > <ns2:h2> ... </ns2:h2> > <ns3:h3> ... </ns3:h3> > </soap:Header> > <soap:Body> > ... > </soap:Body> > </soap:Envelope> > > Consider an intermediary that processes and removes ns2:h2, the second > header. Should it also remove the PI above when relaying the > message to > the next node? The PI might well be giving information about > the element > to follow, or else it might not. If we leave it in place, > does it wind up > inadvertently modifying the third header? The point is that > any feature > like PIs adds complication. SOAP bases all of its processing > and semantics > on the tree of elements. The fact that PIs are not tied to > that tree in an > architecturally robust manner makes it very hard to define > simple or stable > semantics for PI's as a SOAP message flows through a system. > Furthermore, > we would have other complications in the WS stack: should > WSDL provide > rules to describe when PIs are OK and when not? Which PIs? With what > parameters? Another mess. Again, we kept it simple by > ruling them out. > > Summary > ======= > > SOAP uses XML Infosets and serializations to build a framework for > messaging. By definition, SOAP envelope Infosets do not > contain DTDs or > entity references, and external DTDs are disallowed as well. > SOAP uses > pluggable bindings to move messages on the wire; those bindings have > complete discretion as to how to represent the data. Some > might try to > play games using DOCTYPEs and DTDs on the wire, but our standard HTTP > binding does not, and it's probably unlikely that others would. > > Few XML applications use all the features of XML (some don't use > attributes), but clearly SOAP eschews some features such as > DTDs and PIs > that are often viewed as relatively general purpose. This > note sets out > some of our reasons. All SOAP messages are conformant XML > Infosets. All > messages sent by our HTTP binding are conformant XML 1.0 and > can if desired > be processed with conformant processors. Like an XML editor, > SOAP depends > on knowing whether DTDs and PIs are in its XML (in our case, > though, only > for error checking.) SOAP messages also tend to be processable at > relatively high speed by carefully tuned processors. Furthermore, by > prohibiting some of these features, we simplified the > definition of the > SOAP processing model and of description languages used with > SOAP. The > tradeoff is that we have somewhat complicated things for > those who prefer > to use certain off-the-shelf processors, and for those who > want to insert > arbitrary XML into SOAP messages (there are many other problems doing > that...a longer story than we have time for here.) > > Whether SOAP represents a good start on a general purpose > subset of XML is > not a question the XMLP group has actively considered. That was not a > goal. We consider SOAP to be an application of XML, not a > redefinition of > it. We do hope the analysis above is useful to those who are indeed > thinking about XML subsets, and that it clarifies the reasons for our > decisions. > > > [1] http://www.w3.org/2000/xp/Group/2/11/08/soap12-part1.html#soapenv > [2] http://www.w3.org/TR/xml-infoset/#intro.synthetic > [3] > http://www.ietf.org/internet-drafts/draft-hollenbeck-ietf-xml- guidelines-07.txt
Received on Monday, 9 December 2002 13:19:28 UTC