Proposed text on reliability in the web services architecture [lo ng]

[I have an action item to update the WSA draft with text "harvested" from
the lengthy threads last month on reliablility in the WSA and what we should
say and do about it.  I'm not confident enough in my understanding and
summary to plop this right into the document, so I offer it here for your
consideration and critique.  ]

 
The issue of how one ensures that Web services operate reliably in
environments (such as the Internet) that do not offer "industrial strength"
quality of service is a mulitfaceted and contentious one.  Indeed, it's a 
densely interconnected clusters of issues: One might address it in the
messaging infrastructure to put retry and timeout logic that makes repeated
attempts to move SOAP messages from the ultimate origin to their
final destination, reporting the status back to the sender; "Asynchrony" is 
part of the picture because without a reliable substrate, senders and
receivers may communicate "out of band" to ensure that messages were
received; and  it's tangled up with Choreography and Transactions because
Web service invocations can fail for all sorts of reasons besides messages
not being delivered and some sort of recovery protocol may need to be
scripted.  Finally, there is a school of thought that argues that reliable
messaging is not an important feature of the messaging infrastructure if the
application-level protocol is well-designed.

Let's consider the architectural issues as independently as possible. 

Reliable Messaging Layer

First, what can be done to assure that SOAP messages have been received
(possibly after traversing multiple intermediaries) once and only once? 
[Note: it is important to pay very close attention to the wording of
assertions about reliable messaging: there is no way to "assure" that a
message will be recieved, and it appears to be beyond the state of the art
in computer science to reliably determine that a message was NOT
received, so it is misleading to say that  a reliable messaging layer lets
the sender determine "whether" a message was recieved!]. Of course, one
possibility is to use proprietary messaging software at the transport layer;
examples would include  products such as IBM WebSphere MQ and MQ Everywhere,
and Microsoft MSMQ,  and those from a number of vendors that implement the
Java Messaging Service API. 
Another approach, on which this discussion will focus, is taken by the ebXML
Messaging Service specification and a number of vendor-specific
technologies: use SOAP header extensions to define a protocol that defines
ways of handling retries, error notification, removal of duplicates, and
notification of errors. 

For example, in ebXML:
"Reliability is achieved by a Receiving MSH responding to a message with an
Acknowledgment Message. An Acknowledgment Message is any ebXML message
containing an Acknowledgment element. Failure to receive an Acknowledgment
Message by a Sending MSH MAY trigger successive retries until such time as
an Acknowledgment Message is received or the predetermined number of retries
has been exceeded at which time the From Party MUST be notified of the
probable delivery failure. Whenever an identical message may be received
more than once, some method of duplicate detection and elimination is
indicated, usually through the mechanism of a persistent store." 
http://www.oasis-open.org/committees/ebxml-msg/documents/ebMS_v2_0rev_c.pdf

It is worth noting that the problem of reliable messaging is very
successfully addressed on the Internet itself in the TCP protocol, which
sits on top of IP to ensure that message components are delivered once and
only once in the correct order.  We may wish to look at the approach adopted
within TCP:

* TCP uses the assumption that all messages prior to the explicitly ack'ed
message are implicitly ack'ed

* TCP employs the exponential back-off: messages are duplicated with an
exponentially growing time interval between them.

* TCP uses a separation between message sending and acknowledgment, with a
single packet being used to carry two levels of 
the conversation at once. This is quite different to the simple ack of
ebXML--  you tend to send as many acks as messages you send (as opposed to
messages you receive).

 
But in the case of reliable messaging, it seems that you should be able to
use, for example, SOAP over HTTPR on one hop, and SOAP over JMS on the next
hop, and still be able to support reliable messaging end-to-end. (The
message goes reliably from A to C iff it goes reliably from A to B and from
B to C - for example, B waits until it gets the transport-level ack from C
before sending its transport-level ack to A). 
In fact, I think this was the rationale when IBM designed HTTPR, so that you
could go from Internet to intranet (and vice versa), using SOAP over HTTPR
on the Internet, and then switching to SOAP over MQSeries (or other MOM)
once inside the intranet. 

So, there needs to be a "box" in the Web services architecture
document/diagrams to describe components [oops, we aren't supposed to use
that word!  I forget what we are supposed to say ...] that manage message
delivery to handle retries and error reporting to do the best job possible. 
It's not clear to me, however,  whether a W3C specfication of this is
needed, whether the ebXML Messaging reliability layer could be referenced
normatively, or what.  Thoughts?

RESTful Application-Level Protocols in lieu of RM

To some extent, the absence of a reliable messaging layer in the
infrastructure can be alleviated or superceded by employing
application-level protocols that are designed to work with an unreliable
substrate.  REST is the obvious example here: to the extent that one can
model a web service using the resource/representation framework, design the
interaction pattern so that retrievals are "safe" (having no side effects)
and storage operations are "idempotent" (they can be repeated multiple times
without changing the state of the resource being updated), then an
application can simply repeat an operation that fails or whose status is
uncertain. 

[elaboration by a RESTifarian is solicited]

While these are good points to make in the WSA document, there are some
downsides that must be noted:

* It's not clear how this would work in an environment where messages span
different protocols, which may not have the fine-grained and well-defined
error reporting and redirection features of HTTP.

* It requires the web service system to be designed from the ground up to be
RESTful; it's hard to see how legacy procedural code can, in general, be
wrapped up in a layer that exposes it using the resource/representation
framework, makes updates idempotent, etc.

* It forces application developers to think about transport issues much more
than they generally want to.  There's a middle ground between hiding the
unreliable infrastructure behind "magic" IDEs and protocols and making
application developers be aware of all the complex issues involved in
distributed computing.  


Reliability and Business Transaction Processing

One can consider the reliability of web service interactions from within a
richer taxonomy of  reliable messaging features.  For example, David Burdett
http://lists.w3.org/Archives/Public/www-ws-arch/2002Dec/0083.html proposes a
6-level taxonomy from "acknolwedgement only" through "reliable messaging"
(as in ebXML) through "reliable processing" where messaging is linked with
business level transaction processing and recovery features.

Peter Furniss elaborates on this basic idea
http://lists.w3.org/Archives/Public/www-ws-arch/2002Dec/0179.html by
proposing that reliable web service interactions at the business level are
best assured using a "two-phase commit" approach: what you as a sender
really want to know is if the receiver has committed to performing the work
requested by the web service.  If not, various rollback/remediation
operations need to be performed,  whether the lack of commitment is due to
mechanical reasons or business reasons.  He further argues that "the
'simple' ack approach actually requires some extra messages to avoid one or
both sides having to remember the request (or some identification on it)
indefinitely or have a complicated set of timeout rules as to when they can
forget things. (and that's before we worry about surviving crashes)."
These ideas got considerable discussion on the mailing list.  

Conclusion

It would appear that there is a rough consensus that while the higher level
BTP notion of "reliability" certainly has a place in the WSA, and that
promulgating best practices that stress the importance of safe retrieval and
idempotent updates is also a good idea, there is also considerable value to
a relatively simple reliable messaging layer. The WG needs to come up with a
strategy to ensure that the necessary specs are underway, whether that be
within or outside the W3C.
 

Received on Wednesday, 8 January 2003 19:55:50 UTC