RE: Reliable Messaging - Summary of Threads from Assaf Arkin on 2002-12-14 (www-ws-arch@w3.org from December 2002)

From: Assaf Arkin <arkin@intalio.com>
Date: Sat, 14 Dec 2002 13:59:53 -0800
To: "Cutler, Roger \(RogerCutler\)" <RogerCutler@ChevronTexaco.com>, <www-ws-arch@w3.org>
Message-ID: <IGEJLEPAJBPHKACOOKHNGEOICOAA.arkin@intalio.com>
MessageA phone is a perfect example for synchronous communication since
we're talking to each other at the same time.

Just yesterday I had a phone conversation cut short before I got to complete
it. In fact, since it was raining hard, some of what I said got garbled
(that's an annoying side effect of the communication frequency used by cell
phones), so messages were lost before I got disconnected and of course after
I got disconnected.

So a cell phone by itself is not a reliable medium. In fact, it's actually
an asynchronous medium, even though it's duplex (meaning messages are going
in and out on the same line). In fact, the first layer of TCP gives you just
that, the duplexity. Synchronous communication is built on top of a duplex
protocol.

Even without paying attention, we have established very simple rules for
making a conversation synchronous over an inheritly unreliable medium.
Unless we both say 'bye' or 'talk to you later' at the end of the call, we
assume the call got disconnected. And if I say 'bye' but don't hear you say
'bye', I assume it got disconnected, so I would ask you to confirm, or call
you back to make sure you heard everything so far, except the 'byte'.

This is a fairly trivial case, but what happened yesterday, before I got
disconnected, I was saying a few things that did not make it to the other
end. So the other person noticed a silence on the line and asked me 'can you
hear me?'. If I wasn't saying anything, I would have just replied, 'yes I
can, I was just silent for a few seconds'. If I was saying something, I
would reply 'yes, what was the last thing you heard?'. If they tell me that
they hard ABC, but I also followed that with XYZ, I would repeat XYZ.

So the way in which we conduct the conversation allows us to make a
synchronous communication over a not-so-reliable medium. We have in fact
established a common protocol that most people follow in much the same way,
that is quite similar to how TCP works.

And just as you pointed out, the more chatty we are, the more reliable the
conversation becomes.

In fact, distributed protocols are often based on the notion that a two-way
message exchange involving a sequence of n message pairs has validated that
message pairs 1 through n-1 were indeed sent and received. When message pair
n+1 gets exchanged it validates that message pairs 1 through n were indeed
sent and received.

And this can be implemented at multiple levels at the same time. So TCP
would do that for the packets, HTTP 1.1 (with keep-alive connections) would
do that for HTTP messages over TCP, a stateful service would use a
choreography to do that over HTTP, and so forth.

arkin


  I must confess that the whole issue of synchronous, asynchronous and
message loss now has me thoroughly confused.  For example, suppose I am
talking to you on the phone.  I think most people would consider that
synchronous communication.  But have you ever had the experience of having
the other side drop out silently -- you talk and talk and finally realize
that you are talking to yourself -- for an unknown length of time?

  I sort of gather, though I don't know the details, that TCP/IP handles
this by being chatty.  And the more I think about it, the more I think that
actually guaranteeing message delivery can ONLY be done by establishing a
fabrid of conversation that validates the reception in little baby steps.

  -----Original Message-----
  From: Assaf Arkin [mailto:arkin@intalio.com]
  Sent: Friday, December 13, 2002 10:04 PM
  To: Cutler, Roger (RogerCutler); www-ws-arch@w3.org
  Subject: RE: Reliable Messaging - Summary of Threads


  The two army problem is concerned with the possibility of message loss.
Message loss could occur when you are using an asynchronous transport
protocol, though in most literature the term would be medium, where protocol
is a more generic term that would even cover a choreography.

  Although you can have an asynchronous API for performing an operation,
that API is between you and a messaging engine and typically you would use
in-process calls or some synchronous transport, so there's no possibility of
message loss. You can tell without a doubt whether the messaging engine is
going to send the message or not.

  Even if the operation you are doing is asynchronous, you can use a
synchronous protocol such as HTTP POST to deliver the message in which case
there is no possibility for message loss. But you can also use an
asynchronous protocol such as SMTP or UDP, in which case the message could
be lost on the way to its definition. Lost has a loose definition, a message
that gets garbled, delayed or routed to the wrong place is considered lost.

  Addressing message loss is therefore a problem of the protocol you use and
not the operation you perform. So in my opinion that is outside the scope of
WSDL abstract operation definition, but in the scope of specific protocol
bindings, an it would definitely help if the protocol layer (XMLP) could
address that relieving us of the need to define ack operations.

  arkin
    -----Original Message-----
    From: www-ws-arch-request@w3.org [mailto:www-ws-arch-request@w3.org]On
Behalf Of Cutler, Roger (RogerCutler)
    Sent: Friday, December 13, 2002 1:28 PM
    To: Assaf Arkin; www-ws-arch@w3.org
    Subject: RE: Reliable Messaging - Summary of Threads


    Thanks for the support.

    One thing this note reminded me of -- I have seen a number of different
definitions of "synchronous" floating around this group.  In fact, if my
memory serves, there are three major ones.  One concentrates on the idea
that a call "blocks" if it is synchronous, another has a complicated logic
that I cannot recall and the third (contained in one of the references on
the two army problem) concentrates on the length of time it takes for a
message to arrive.  The formality of all of these definitions indicates to
me that all have had considerable thought put into them and that all are, in
their context, "correct".  They are, however, also different.

    -----Original Message-----
    From: Assaf Arkin [mailto:arkin@intalio.com]
    Sent: Friday, December 13, 2002 2:27 PM
    To: Cutler, Roger (RogerCutler); www-ws-arch@w3.org
    Subject: RE: Reliable Messaging - Summary of Threads



      3 - There is concern about the "two army" problem, which essentially
says that it is not possible, given certain assumptions about the types of
interactions, for all parties in the communication to reliably reach
consensus about what has happened.  I have been trying to encourage the
objective of documenting the scenarios that can come up in and their
relative importance and possibly mitigating factors or strategies.  I
haven't seen people violently disagreeing but I wouldn't call this a
consensus point of view.  I consider the ebXML spec as weak in discussing
the two-army problem.

      The two army problem assumes you are using a non-reliable medium for
all your communication and proves that it is impossible for the sender to
reach confidence that the message has arrived and is processed in 100% of
cases.

      You can increase your level of confidence by using message + ack and
being able to resend a message and receive a duplicate ack. That get's you
close to a 100% but not quite there, but it means that in most cases the
efficient solution (using asynchronous messaging) would work, and so
presents a viable option.

      In my opinion it is sufficient for a low level protocol to give you
that level of reliability. And that capability is generic enough that we
would want to address it at the protocol level in a consistent manner, so we
reduce at least one level of complexity for the service developer. It is
also supported by a variety of transport protocols and mediums.

      This still doesn't mean you can get two distributed services to
propertly communicate with each other in all cases. A problem arises if
either the message was not received (and is not processed), a message was
received but no ack recevied (and is processed) or a message was received
and an ack was received but the message is still not processed.

      That problem is not unique to asynchronous messaging, in fact it also
presents itself when synchronous messaging is used. With synchronous
messaging you have 100% confidence that a message was received, but no
confidence that it will be processed. Furthermore, you may fail before you
are able to persist that information, in which case your confidence is lost.

      If you do not depend on the result of the message being processed than
you would simply regard each message that is sent as being potentially
processed. You use the ack/resend mechanism as a way to increase the
probability that the message indeed reaches its destination, so a majority
of your messages will be received and.

      I argue that using ack/resend you could reach the same level of
confidence that the message will be processed as if you were using a
synchronous protocol, but could do so more efficiently.

      If you do depend on the message being processes, then you are in a
different class of problem, and simply having a reliable protocol is not
sufficient since it does not address the possibility that the message was
received, acked but not processed. It in fact presents the same problem that
would arise when synchronous protocols are used.

      This is best solved at a higher layer. There are two possible
solutions, both of which are based on the need to reach a concensus between
two systems. One solution is based on a two-phase commit protocol, which
could be extended to use asynchronous patterns. A more efficient solution in
terms of message passing would be to use state transitions that coordinate
through the exchange of well defined messages. This could be modeled using a
choreography language.

      Since this is outside the scope of this discussion I will not go into
details, but if anyone is interested I would recommend looking at protocols
for handling failures in distributed systems (in particular Paxos). In my
understanding these protocols are applicable for modeling at the
choreography language and are more efficient than using transactional
protocols and two-phase commit.

      My only point here was to highlight that a solution involving
ack/resend is sufficient to give you the same level of confidence that a
message would be processed as if you were using a synchronous operation, and
that solutions for achieving 100% confidence are required whether you are
using asynchronous or synchronous messaging.

      This is in support of Roger's recommendation for adding ack support to
XMLP.

       regards,
       arkin
Received on Saturday, 14 December 2002 17:00:56 UTC