Reliable Web Services

From: Cutler, Roger (RogerCutler)
Date: Wed, 11 Dec 2002
Message-ID: <7FCB5A9F010AAE419A79A54B44F3718E01817C3A@bocnte2k3.boc.chevrontexaco.net>
To: www-ws-arch@w3.org
Some thoughts about reliable messaging.  This is going to be a bit
specific, and I know that I may not have the right perspective to get
some of this stuff right, but at least for me thinking along these lines
seems helpful in defining the relevant subject space and the 80-20.
Please bear with me if I am going over some grossly familiar territory
-- I think I may have some reasonable questions coming out the other end
of this discussion.

I think I agree that "just ack" is probably the key to a good start, and
maybe a good 80-20.  By "ack" I believe what we are talking about is: if
A is sending to B, then A keeps sending the same message (which has an
ID) until either receiving an acknowledgement of receipt or some
pre-defined timeout or max-tries criterion is met.  B, on the other
hand, must be able to handle repeats of the same message.

However, there are clear problems that I think people are calling the
"two army" problem (why two armies I have been unable to determine).
Basically, I think that this refers to the impossibility, given certain
assumptions, to reach consensus among all parties as to what has
happened.  For example (and although I know that there are other
possibilities, I think that this is the most likely one), if A is
sending to B, B may have received the message and sent an ack (possibly
many times), but A never gets any ack.  Under these circumstances A
thinks the message has not been delivered, B thinks it has and acts on

First, does this really matter?  Unfortuanately, I think that it does.
I believe it is relatively easy to devise a number of scenarios where
this kind of thing can happen -- some of them involving malicious
actions or fraud and others just involving bad equipment, preparation or
luck.  I think it would be a good idea to articulate some of these
scenarios and analyze the likelihood and consequences.  (The lack of
this kind of discussion is, I believe, a significant weakness in the
ebXML reliable messaging spec).

Would the situation be changed materially if the spec were changed so
that A, at the time of "giving up", sent a "last message" to B saying,
stated informally, "I've been trying to send you a message with ID xxx
and I have not gotten an ack.  I'm giving up now.  If in fact you got
the message, be warned that I don't know it.  Here is some contact
information in case you want to try to explore this situation further"?
I believe that this extension would address some of the failure
scenarios but not others.

Are there other simple additions, or alternative strategies, that would
further narrow the failure space?  I have another idea, but it is more
complex than I wish to discuss at the moment -- and I'm not sure how
reasonable it is anyway.  I think that it involves going outside the
framework of the assumptions that yield the two army problem in the
first place.

It seems to me likely that if such a scenario analysis were pursued that
one would probably find a high-value subset of the problem to address.
I think personally that there is little to be gained by making the "ack"
mechanism too elaborate or trying to inject a lot more sophistication
into it.  This is because I am guessing, on the basis of current
business practice (in EDI, for example) what an analysis of the
"malicious actions and fraud" scenarios would be.  I think that the way
these things are REALLY handled in business is essentially to split the
transaction up into a bunch of choreographed pieces.  For example, if A
sends B a purchase order, the "ack" from B just says, "I got it", not "I
understand and can handle this".  There is then a separate confirmation
message sent from B to A saying "Yes, this is a PO I understand, I've
got the merchandise, your terms are acceptable, and so on".  The ack
from A back to B is, again, just an "I got it", not an "I agree".

I think that the effect of choreographing the interaction in this way is
essentially to make it much more reliable and controllable by making it
proceed in baby steps over a period of time.  Once it is started, if any
of the steps does not happen as expected this, in itself, raises an
error condition independent of the messaging issues.

I also think that this approach indicates a fairly strict limitation to
the semantics (if I dare use the word) contained in the ack.

What I am getting out of this is that the flawed reliable messaging
solution (e.g. the ebXML spec) is probably "good enough" for the
purpose, possibly with some minor elaboration, but I think that it is
important that the flaws and the possible remediation strategies be
explored and clearly documented.
