re: XMLP WG Response on "SOAP and the Internal Subset" from Larry Masinter on 2002-12-10 (www-tag@w3.org from December 2002)

From: Larry Masinter <LMM@acm.org>
Date: Tue, 10 Dec 2002 14:32:04 -0800
To: <www-tag@w3.org>
Cc: <ietf-xml-use@imc.org>
Message-ID: <000601c2a09b$f7179c90$79552099@MASINTER>
In reply to:

http://lists.w3.org/Archives/Public/www-tag/2002Dec/0119.html 

Since we spent some time on this topic when discussing the "XML
Guidelines" document (http://www.imc.org/ietf-xml-use) I thought I
would respond to the points in the XMLP WG response.

I am not opposed to XMLP WG disallowing an "Internal Subset" in SOAP
messages: I just think it would be good to be clearer about the
justification. Also, if these are good syntactic restrictions for
XMLP, then they are likely to be a good syntactic restrictions for
other XML applications in protocols. Our concern wasn't that "there
should be no subsets" as much as it is "there should not be widely
varying subsets". So, if XMLP has good reason for syntactic
restrictions, those restrictions should be documented separately, so
that a separate class of XML processors that generate and consume the
restricted class of XML documents could be more widely supported, not
just in SOAP processors.

I understand that the XMLP constraints only apply to the HTTP binding
of SOAP, and that it is possible to define a different bindings with
different properties.

> ....... Doing general entity substitution beyond that mandated by
> XML 1.0 (e.g. &lt;) implies a degree of buffer management, often
> data copying, etc. which can be a noticeable burden when going for
> truly high performance.  This performance effect has been reported
> by workgroup members who are building high performance SOAP
> implementations.

We couldn't find a first-hand account of such performance effects on
implementations that allow entity substitutions, in cases where such
entity substitutions aren't used, and if you have them, it would be
very helpful if you could share them. It was easy to imagine that
there might be issues of code footprint, but not cases where there was
actually a performance impact if entity definitions weren't actually
used. Were there details with any of these reports of 'performance
effects'?

> Furthermore, a DTD in the Infoset would become another piece of the
> message.  We would have questions to answer: what are the rules for
> relaying through an intermediary?

It would be useful to define XMLP in terms of the 'canonical InfoSet':
the Infoset of the RFC 3076 Canonical XML of the document. In
particular, all entities are expanded and DTDs removed from the
Canonical XML.

> what are the rules for
> relaying through an intermediary?  If something comes into an
intermediary
> as an entity reference, must it go out as an entity reference? If that
> header is removed by the intermediary, must one check whether it is
the
> last use of the entity and should the outbound DTD have the definition
> removed?  

If intermediary processing is defined in terms of a canonical Infoset
and not the concrete syntax, then there are no requirements that
entity references be preserved. On the other hand, intermediaries
might even introduce Internal Subset DTDs, e.g., when forwarding
messages.

> What does all this do to digital signatures?  If we allowed an
> internal subset, should we change our rules to allow attributes to be
> defaulted? 

Digital signatures are defined in terms of canonical XML, which has
all entities resolved and DTDs removed.

> All of this is complication. 

It would seem less complicated to say "XML" than "XML except no
Internal Subsert DTDs". The complexity is in making excpetions.

> Security is another concern.  Although we have not formally
> demonstrated that XML with internal subset is less secure, several
> members of the workgroup shared an intuition that entity
> substitution, attribute defaulting, and other manipulation of the
> message content was more likely to lead to security exposures,
> denial of service attacks (e.g. the billion laughs entity attack),
> etc.

Any message from any unauthenticated source introduces the potential
for a denial of service attack, merely from the possibilities of
overly long URI paths, element names, attribute values, content,
etc. When parsing any message from an unauthenticated source, it's
necesasry to insure that parsing the message doesn't consume undue
resources in the receiver.  The parsing and substitution of entity
definitions is just one of many such considerations. It isn't much
more work to insure that there aren't a billion laughs as it is to
insure that the URL with a billion hahas isn't being consumed.

> Our reasons for disallowing reference to external DTDs were similar
> to those given above for the internal subset.  In addition, we felt
> that it would not in general be appropriate to require a SOAP
> processor to open a connection to the Web in order to retrieve
> external DTDs.

I think the external DTDs are a different consideration.

I also wonder about the justification for forbidding processing
instructions. Yes, they are ambiguous, because their scope is not
defined. But it is the nature of processing instructions to have
private semantics, and the ambiguity of scope is only one of many
difficulties.  Wouldn't it be sufficient to note that processing
instructions are unreliable, should be ignored by receivers, may
appear in XML messages, and should not be sent?  By themselves they
are harmless.

"Again, we kept it simple by ruling them out."

The philosophy of 'everything that is not mandatory is forbidden'
doesn't actually make for robust protocol design or simplify the
design. By making an exception to standard XML protocol, you make
things more complicated. What is the complexity cost of receivers
ignoring processing instructions vs. explicitly checking for them and
disallowing them?
Received on Tuesday, 10 December 2002 17:32:35 UTC