Re: Reliable Web Services from Francis McCabe on 2002-12-11 (www-ws-arch@w3.org from December 2002)

From: Francis McCabe <fgm@fla.fujitsu.com>
Date: Wed, 11 Dec 2002 13:41:56 -0800
To: "Cutler, Roger (RogerCutler)" <RogerCutler@ChevronTexaco.com>
Cc: www-ws-arch@w3.org
Message-Id: <5F0D26CC-0D51-11D7-994E-000393A3327C@fla.fujitsu.com>
I think that we could do worse than look at the approach adopted within 
TCP; as this has some implications for choreography too.

In that approach (sorry if I am teaching grandmothers) each message 
includes an ack of a particular message -- the assumption that all 
messages prior to the explicitly ack'ed message are implicitly ack'ed 
too. Occasionally, an empty ack message is sent to keep the channel 
alive.

Also built into to TCP is the exponential back-off: messages are 
duplicated with an exponentially growing time interval between them.

This is choreography level message reliability; and therein lies some 
issues (e.g., not all choreographies are between 2 parties)

Of the essence here is a separation between message sending and 
acknowledgment, with a single packet being used to carry two levels of 
the conversation at once. This is quite different to the simple ack 
proposal, although it doesn't itself solve the 2 army problem; it is 
pretty efficient as you tend to send as many acks as messages you send 
(as opposed to messages you receive).

On Wednesday, December 11, 2002, at 11:42  AM, Cutler, Roger 
(RogerCutler) wrote:

> Some thoughts about reliable messaging.  This is going to be a bit 
> specific, and I know that I may not have the right perspective to get 
> some of this stuff right, but at least for me thinking along these 
> lines seems helpful in defining the relevant subject space and the 
> 80-20.  Please bear with me if I am going over some grossly familiar 
> territory -- I think I may have some reasonable questions coming out 
> the other end of this discussion.
>
> I think I agree that "just ack" is probably the key to a good start, 
> and maybe a good 80-20.  By "ack" I believe what we are talking about 
> is: if A is sending to B, then A keeps sending the same message (which 
> has an ID) until either receiving an acknowledgement of receipt or 
> some pre-defined timeout or max-tries criterion is met.  B, on the 
> other hand, must be able to handle repeats of the same message.
>
> However, there are clear problems that I think people are calling the 
> "two army" problem (why two armies I have been unable to determine).  
> Basically, I think that this refers to the impossibility, given 
> certain assumptions, to reach consensus among all parties as to what 
> has happened.  For example (and although I know that there are other 
> possibilities, I think that this is the most likely one), if A is 
> sending to B, B may have received the message and sent an ack 
> (possibly many times), but A never gets any ack.  Under these 
> circumstances A thinks the message has not been delivered, B thinks it 
> has and acts on it.
>
> First, does this really matter?  Unfortuanately, I think that it 
> does.  I believe it is relatively easy to devise a number of scenarios 
> where this kind of thing can happen -- some of them involving 
> malicious actions or fraud and others just involving bad equipment, 
> preparation or luck.  I think it would be a good idea to articulate 
> some of these scenarios and analyze the likelihood and consequences.  
> (The lack of this kind of discussion is, I believe, a significant 
> weakness in the ebXML reliable messaging spec).
>
> Would the situation be changed materially if the spec were changed so 
> that A, at the time of "giving up", sent a "last message" to B saying, 
> stated informally, "I've been trying to send you a message with ID xxx 
> and I have not gotten an ack.  I'm giving up now.  If in fact you got 
> the message, be warned that I don't know it.  Here is some contact 
> information in case you want to try to explore this situation 
> further"?  I believe that this extension would address some of the 
> failure scenarios but not others.
>
> Are there other simple additions, or alternative strategies, that 
> would further narrow the failure space?  I have another idea, but it 
> is more complex than I wish to discuss at the moment -- and I'm not 
> sure how reasonable it is anyway.  I think that it involves going 
> outside the framework of the assumptions that yield the two army 
> problem in the first place.
>
> It seems to me likely that if such a scenario analysis were pursued 
> that one would probably find a high-value subset of the problem to 
> address.  I think personally that there is little to be gained by 
> making the "ack" mechanism too elaborate or trying to inject a lot 
> more sophistication into it.  This is because I am guessing, on the 
> basis of current business practice (in EDI, for example) what an 
> analysis of the "malicious actions and fraud" scenarios would be.  I 
> think that the way these things are REALLY handled in business is 
> essentially to split the transaction up into a bunch of choreographed 
> pieces.  For example, if A sends B a purchase order, the "ack" from B 
> just says, "I got it", not "I understand and can handle this".  There 
> is then a separate confirmation message sent from B to A saying "Yes, 
> this is a PO I understand, I've got the merchandise, your terms are 
> acceptable, and so on".  The ack from A back to B is, again, just an 
> "I got it", not an "I agree".
>
> I think that the effect of choreographing the interaction in this way 
> is essentially to make it much more reliable and controllable by 
> making it proceed in baby steps over a period of time.  Once it is 
> started, if any of the steps does not happen as expected this, in 
> itself, raises an error condition independent of the messaging issues.
>
> I also think that this approach indicates a fairly strict limitation 
> to the semantics (if I dare use the word) contained in the ack.
>
> What I am getting out of this is that the flawed reliable messaging 
> solution (e.g. the ebXML spec) is probably "good enough" for the 
> purpose, possibly with some minor elaboration, but I think that it is 
> important that the flaws and the possible remediation strategies be 
> explored and clearly documented.
>
Received on Wednesday, 11 December 2002 16:42:16 UTC