Possible new protocols issue/erratum: characters in Infoset & version of Infoset from noah_mendelsohn@us.ibm.com on 2004-03-03 (xmlp-comments@w3.org from March 2004)

From: <noah_mendelsohn@us.ibm.com>
Date: Wed, 3 Mar 2004 09:09:56 -0500
To: xmlp-comments@w3.org
Cc: Richard Tobin <richard@cogsci.ed.ac.uk>
Message-ID: <OFB88F47C6.F2D497FF-ON85256E4C.004B8D9B@lotus.com>

SOAP 1.2 specifically depends on the October 24, 2001 version of
Infoset[1].  At the time, the latest published version of XML was XML 1.0
Second Edition.  As an aside, that version of Infoset has a seemingly
misleading reference which says:

"XML Extensible Markup Language (XML) 1.0 (Second Edition), W3C, eds. Tim
Bray, Jean Paoli, C.M. Sperberg-McQueen, Eve Maler. 6 October 2000.
Available at http://www.w3.org/TR/REC-xml. "

The supplied link now resolves to the Third edition.

Anyway, I had always assumed that the version of Infoset to which SOAP
refers would limit character children in synthetic Infosets (typical of
SOAP) as well as others to the then-legal XML characters [3].  Richard
Tobin was kind enough to point out to me that Infoset in fact has no such
limitation on the contents of character children.

Since we don't explicitly enforce such a restriction in SOAP either, for
example in the body child element[4], we have what I take to be the bizarre
situation that SOAP envelope infosets can per our recommendationcontain
non-XML characters.  It could, for example, contain nulls or the
XML-forbidden control characters below x20.  I don't believe this was
intentional.

Also:  I think this means that our HTTP binding contradicts the
requirements of the binding framework which states that:

"the minimum responsibility of a binding in transmitting a message is to
specify the means by which the SOAP message infoset is transferred to and
reconstituted by the binding at the receiving SOAP node and to specify the
manner in which the transmission of the envelope is effected using the
facilities of the underlying protocol."

So, our specification is self-contradictory.  I think this is good news of
a sort, as it means we can consider fixing this with an erratum.  I believe
that an example of a specific fix would be to state in of the body child
element [4]:

<recommendationText>
MAY have any number of character information item children. Child character
information items whose character code is amongst the white space
characters as defined by XML 1.0 [XML 1.0] are considered significant.
</recommendationText>
<proposedRevision>
MAY have any number of character information item children. >Each such
child must have as its [character code] a value which matches the {char}
production of XML 1.0 [XML 1.0].<  Child character information items whose
character code is amongst the white space characters as defined by XML 1.0
[XML 1.0] are considered significant.
</proposedRevision>

Obviously we should look through all parts of the rec to see if there are
other similar slip ups.  FWIW:  most of our other character children are
actually schema typed, and therefore don't have this problem.  The
mustUnderstand attribute, for example, is a boolean.  The schema for SOAP
constrains its characters to {true, false, 0, 1}.

Noah

[1] http://www.w3.org/TR/2001/REC-xml-infoset-20011024/
[2]http://www.w3.org/TR/2000/REC-xml-20001006
[3] http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char
[4] http://www.w3.org/TR/soap12-part1/#soapbodyel
[5] http://www.w3.org/TR/soap12-part1/#bindfw

--------------------------------------
Noah Mendelsohn
IBM Corporation
One Rogers Street
Cambridge, MA 02142
1-617-693-4036
--------------------------------------

Received on Wednesday, 3 March 2004 09:32:10 UTC