RE: Reliable Web Services from Newcomer, Eric on 2002-12-12 (www-ws-arch@w3.org from December 2002)

From: Newcomer, Eric <Eric.Newcomer@iona.com>
Date: Thu, 12 Dec 2002 11:43:27 -0500
To: "Cutler, Roger (RogerCutler)" <RogerCutler@ChevronTexaco.com>, <www-ws-arch@w3.org>
Message-ID: <DCF6EF589A22A14F93DFB949FD8C4AB2BA16FA@amereast-ems1.IONAGLOBAL.COM>
Roger,
 
Yes, there is definitely a need for a follow up interaction to confirm the success or failure of the requested action, and the need to define what happens if A sends a message and never receives an Ack.  You are implying, and I think correctly, that orchestration is the place to define this type of higher level flow that includes application level semantics.  
 
So if we can assume the simple ack is a reasonable starting place, then within that mechanism I take from your thoughts that we need to add some definition around handling the failure case.  The rest I might suggest putting into the orchestration bucket?
 
Eric

-----Original Message-----
From: Cutler, Roger (RogerCutler) [mailto:RogerCutler@ChevronTexaco.com]
Sent: Wednesday, December 11, 2002 2:43 PM
To: www-ws-arch@w3.org
Subject: Reliable Web Services



Some thoughts about reliable messaging.  This is going to be a bit specific, and I know that I may not have the right perspective to get some of this stuff right, but at least for me thinking along these lines seems helpful in defining the relevant subject space and the 80-20.  Please bear with me if I am going over some grossly familiar territory -- I think I may have some reasonable questions coming out the other end of this discussion.

I think I agree that "just ack" is probably the key to a good start, and maybe a good 80-20.  By "ack" I believe what we are talking about is: if A is sending to B, then A keeps sending the same message (which has an ID) until either receiving an acknowledgement of receipt or some pre-defined timeout or max-tries criterion is met.  B, on the other hand, must be able to handle repeats of the same message.

However, there are clear problems that I think people are calling the "two army" problem (why two armies I have been unable to determine).  Basically, I think that this refers to the impossibility, given certain assumptions, to reach consensus among all parties as to what has happened.  For example (and although I know that there are other possibilities, I think that this is the most likely one), if A is sending to B, B may have received the message and sent an ack (possibly many times), but A never gets any ack.  Under these circumstances A thinks the message has not been delivered, B thinks it has and acts on it.

First, does this really matter?  Unfortuanately, I think that it does.  I believe it is relatively easy to devise a number of scenarios where this kind of thing can happen -- some of them involving malicious actions or fraud and others just involving bad equipment, preparation or luck.  I think it would be a good idea to articulate some of these scenarios and analyze the likelihood and consequences.  (The lack of this kind of discussion is, I believe, a significant weakness in the ebXML reliable messaging spec).

Would the situation be changed materially if the spec were changed so that A, at the time of "giving up", sent a "last message" to B saying, stated informally, "I've been trying to send you a message with ID xxx and I have not gotten an ack.  I'm giving up now.  If in fact you got the message, be warned that I don't know it.  Here is some contact information in case you want to try to explore this situation further"?  I believe that this extension would address some of the failure scenarios but not others.

Are there other simple additions, or alternative strategies, that would further narrow the failure space?  I have another idea, but it is more complex than I wish to discuss at the moment -- and I'm not sure how reasonable it is anyway.  I think that it involves going outside the framework of the assumptions that yield the two army problem in the first place.

It seems to me likely that if such a scenario analysis were pursued that one would probably find a high-value subset of the problem to address.  I think personally that there is little to be gained by making the "ack" mechanism too elaborate or trying to inject a lot more sophistication into it.  This is because I am guessing, on the basis of current business practice (in EDI, for example) what an analysis of the "malicious actions and fraud" scenarios would be.  I think that the way these things are REALLY handled in business is essentially to split the transaction up into a bunch of choreographed pieces.  For example, if A sends B a purchase order, the "ack" from B just says, "I got it", not "I understand and can handle this".  There is then a separate confirmation message sent from B to A saying "Yes, this is a PO I understand, I've got the merchandise, your terms are acceptable, and so on".  The ack from A back to B is, again, just an "I got it", not an "I agree".

I think that the effect of choreographing the interaction in this way is essentially to make it much more reliable and controllable by making it proceed in baby steps over a period of time.  Once it is started, if any of the steps does not happen as expected this, in itself, raises an error condition independent of the messaging issues.

I also think that this approach indicates a fairly strict limitation to the semantics (if I dare use the word) contained in the ack.

What I am getting out of this is that the flawed reliable messaging solution (e.g. the ebXML spec) is probably "good enough" for the purpose, possibly with some minor elaboration, but I think that it is important that the flaws and the possible remediation strategies be explored and clearly documented.
Received on Thursday, 12 December 2002 11:44:32 UTC