RE: Proposed text on reliability in the web services architecture [lo ng] from Cutler, Roger (RogerCutler) on 2003-01-09 (www-ws-arch@w3.org from January 2003)

From: Cutler, Roger (RogerCutler) <RogerCutler@ChevronTexaco.com>
Date: Thu, 9 Jan 2003 14:19:05 -0600
To: "Champion, Mike" <Mike.Champion@SoftwareAG-USA.com>, www-ws-arch@w3.org
Message-ID: <7FCB5A9F010AAE419A79A54B44F3718E01817C76@bocnte2k3.boc.chevrontexaco.net>
Although obviously a bit rough, I think that this is a truly excellent starting point.  I say this having tried myself to summarize the threads and not doing anywhere near as well.

The only quibble that I would like to bring up at this time is the fact that HTTPR is sort of casually mentioned, as if everybody knows what it is and its significance, before anything is said about what it is.  I personally do not understand the role of HTTPR well enough to be comfortable with this and I think it should be introduced and categorized somehow, as are other techniques mentioned.  I am highlighting this because I have the vague feeling that HTTPR is a bit controversial, although I could easily be wrong.

-----Original Message-----
From: Champion, Mike [mailto:Mike.Champion@SoftwareAG-USA.com] 
Sent: Wednesday, January 08, 2003 6:55 PM
To: www-ws-arch@w3.org
Subject: Proposed text on reliability in the web services architecture [lo ng]




[I have an action item to update the WSA draft with text "harvested" from the lengthy threads last month on reliablility in the WSA and what we should say and do about it.  I'm not confident enough in my understanding and summary to plop this right into the document, so I offer it here for your consideration and critique.  ]

 
The issue of how one ensures that Web services operate reliably in environments (such as the Internet) that do not offer "industrial strength" quality of service is a mulitfaceted and contentious one.  Indeed, it's a 
densely interconnected clusters of issues: One might address it in the messaging infrastructure to put retry and timeout logic that makes repeated attempts to move SOAP messages from the ultimate origin to their final destination, reporting the status back to the sender; "Asynchrony" is 
part of the picture because without a reliable substrate, senders and receivers may communicate "out of band" to ensure that messages were received; and  it's tangled up with Choreography and Transactions because Web service invocations can fail for all sorts of reasons besides messages not being delivered and some sort of recovery protocol may need to be scripted.  Finally, there is a school of thought that argues that reliable messaging is not an important feature of the messaging infrastructure if the application-level protocol is well-designed.

Let's consider the architectural issues as independently as possible. 

Reliable Messaging Layer

First, what can be done to assure that SOAP messages have been received (possibly after traversing multiple intermediaries) once and only once? 
[Note: it is important to pay very close attention to the wording of assertions about reliable messaging: there is no way to "assure" that a message will be recieved, and it appears to be beyond the state of the art in computer science to reliably determine that a message was NOT received, so it is misleading to say that  a reliable messaging layer lets the sender determine "whether" a message was recieved!]. Of course, one possibility is to use proprietary messaging software at the transport layer; examples would include  products such as IBM WebSphere MQ and MQ Everywhere, and Microsoft MSMQ,  and those from a number of vendors that implement the Java Messaging Service API. 
Another approach, on which this discussion will focus, is taken by the ebXML Messaging Service specification and a number of vendor-specific
technologies: use SOAP header extensions to define a protocol that defines ways of handling retries, error notification, removal of duplicates, and notification of errors. 

For example, in ebXML:
"Reliability is achieved by a Receiving MSH responding to a message with an Acknowledgment Message. An Acknowledgment Message is any ebXML message containing an Acknowledgment element. Failure to receive an Acknowledgment Message by a Sending MSH MAY trigger successive retries until such time as an Acknowledgment Message is received or the predetermined number of retries has been exceeded at which time the From Party MUST be notified of the probable delivery failure. Whenever an identical message may be received more than once, some method of duplicate detection and elimination is indicated, usually through the mechanism of a persistent store." 
http://www.oasis-open.org/committees/ebxml-msg/documents/ebMS_v2_0rev_c.pdf

It is worth noting that the problem of reliable messaging is very successfully addressed on the Internet itself in the TCP protocol, which sits on top of IP to ensure that message components are delivered once and only once in the correct order.  We may wish to look at the approach adopted within TCP:

* TCP uses the assumption that all messages prior to the explicitly ack'ed message are implicitly ack'ed

* TCP employs the exponential back-off: messages are duplicated with an exponentially growing time interval between them.

* TCP uses a separation between message sending and acknowledgment, with a single packet being used to carry two levels of 
the conversation at once. This is quite different to the simple ack of
ebXML--  you tend to send as many acks as messages you send (as opposed to messages you receive).

 
But in the case of reliable messaging, it seems that you should be able to use, for example, SOAP over HTTPR on one hop, and SOAP over JMS on the next hop, and still be able to support reliable messaging end-to-end. (The message goes reliably from A to C iff it goes reliably from A to B and from B to C - for example, B waits until it gets the transport-level ack from C before sending its transport-level ack to A). 
In fact, I think this was the rationale when IBM designed HTTPR, so that you could go from Internet to intranet (and vice versa), using SOAP over HTTPR on the Internet, and then switching to SOAP over MQSeries (or other MOM) once inside the intranet. 

So, there needs to be a "box" in the Web services architecture document/diagrams to describe components [oops, we aren't supposed to use that word!  I forget what we are supposed to say ...] that manage message delivery to handle retries and error reporting to do the best job possible. 
It's not clear to me, however,  whether a W3C specfication of this is needed, whether the ebXML Messaging reliability layer could be referenced normatively, or what.  Thoughts?

RESTful Application-Level Protocols in lieu of RM

To some extent, the absence of a reliable messaging layer in the infrastructure can be alleviated or superceded by employing application-level protocols that are designed to work with an unreliable substrate.  REST is the obvious example here: to the extent that one can model a web service using the resource/representation framework, design the interaction pattern so that retrievals are "safe" (having no side effects) and storage operations are "idempotent" (they can be repeated multiple times without changing the state of the resource being updated), then an application can simply repeat an operation that fails or whose status is uncertain. 

[elaboration by a RESTifarian is solicited]

While these are good points to make in the WSA document, there are some downsides that must be noted:

* It's not clear how this would work in an environment where messages span different protocols, which may not have the fine-grained and well-defined error reporting and redirection features of HTTP.

* It requires the web service system to be designed from the ground up to be RESTful; it's hard to see how legacy procedural code can, in general, be wrapped up in a layer that exposes it using the resource/representation framework, makes updates idempotent, etc.

* It forces application developers to think about transport issues much more than they generally want to.  There's a middle ground between hiding the unreliable infrastructure behind "magic" IDEs and protocols and making application developers be aware of all the complex issues involved in distributed computing.  


Reliability and Business Transaction Processing

One can consider the reliability of web service interactions from within a richer taxonomy of  reliable messaging features.  For example, David Burdett http://lists.w3.org/Archives/Public/www-ws-arch/2002Dec/0083.html proposes a 6-level taxonomy from "acknolwedgement only" through "reliable messaging" (as in ebXML) through "reliable processing" where messaging is linked with business level transaction processing and recovery features.

Peter Furniss elaborates on this basic idea http://lists.w3.org/Archives/Public/www-ws-arch/2002Dec/0179.html by proposing that reliable web service interactions at the business level are best assured using a "two-phase commit" approach: what you as a sender really want to know is if the receiver has committed to performing the work requested by the web service.  If not, various rollback/remediation operations need to be performed,  whether the lack of commitment is due to mechanical reasons or business reasons.  He further argues that "the 'simple' ack approach actually requires some extra messages to avoid one or both sides having to remember the request (or some identification on it) indefinitely or have a complicated set of timeout rules as to when they can forget things. (and that's before we worry about surviving crashes)." These ideas got considerable discussion on the mailing list.  

Conclusion

It would appear that there is a rough consensus that while the higher level BTP notion of "reliability" certainly has a place in the WSA, and that promulgating best practices that stress the importance of safe retrieval and idempotent updates is also a good idea, there is also considerable value to a relatively simple reliable messaging layer. The WG needs to come up with a strategy to ensure that the necessary specs are underway, whether that be within or outside the W3C.
Received on Thursday, 9 January 2003 15:19:38 UTC