Bananas: Handling errors and timeouts in a choreography

Authors: Monica Martin, Sun Microsystems Inc.

Steve Ross-Talbot, Enigmatec Corporation Ltd.

Date: 3^rd March 2004

Definitions

In any interaction between processes (and by processes I mean a more generic definition than just web services but given it is a superset it applies equally to web services) if the observable scope of a process is at the point at which interaction is observed then we say that we can observe the external behaviour of that process. For example if a process A sends an order to to another process, B, and that process enacts some logic to determine what to do next; it could determine it was for a premium customer - in which case it send the order to a process called C or it may determine that is for an ordinary customer - in which case it send the order to a process called D then all we can observe is the observable interaction of passing the order from one process to another. Thus valid sequences in the overal grammar that represents the choreography are (I'm going to use a pseudo pi-calculus to avoid ambiguity):

a.A.x' | x.B.y' | y.C

a.A.x' | x.B.z' | z.D

Where "|" represent the combination in parallel of the process that are interacting.

Where orderN represents a receipt of a message and x' the sending of a message. In this case the x and x and y and y' represent

types channels over which directional communication takes place.

Where "." represent sequence, that is and order is received on a and then A does it's stuff and then sends an order on x' which B receives on x and then sends the order on y' which C receives on y (or of course sends on z' and D receives on z.

We can label these as:

PremiumCustomer ::= a.A.x' | x.B.y' | y.C

OrdinaryCustomer ::= a.A.x' | x.B.z' | z.D

Error handling and timeouts

Now let's move onto error handling. There are two levels that we need to consider. The first is dealing with exceptional circumstances arising from a failure in A, B, C or D and the second arises from out of bound message exchanges; these are messages for which there is no definition in the choreography description that is able to handle them in the current context in which a context is a collaboration group.

As far as A, B, C and D are concerned should an error occur at A, B, C or D at any time it may result in either a different message being send (on another channel created for the purpose) or no message being sent. So on the one hand we have the presence of a message and on the other we have the absence of a message.

The presence of a message can be modeled analogously. The absence of a message could be viewed as a message from a timeout, for example we could change the definition of a PremiumCustomer as follows:

PremiumCustomer ::= (a.A.x' + t.A) | (x.B.y' + t.B.t') | y.C

The "+" operator is a choice.

In this example if B fails then it receives a timeout on t and then sends a timeout on t'' which A receives. C is unaware and waits for an order on order3; which is okay because it hasn't progressed at all at this stage. The net result is an alternative path is taken in order to deal with the observable message that resulted from the timeout.

The other way of dealing with this is if process B, at least in this case, sends a different message (not a timeout message) which requires C and possibly A to take action of some sort. The only thing that needs to be modeled is the observable interaction that ensues from the point of view of the receivers A and C.

The OrdinaryCustomer is analogous.

Handling out of bounds messages in a choreography is really no more that a parsing error across the allowable messages at any point. If such an out of bounds message occurs then the choreography need do nothing about it. Why is this? Simply because the runtime environment (which is vendor specific) would be required to do something about it. This is dependent on the role that the runtime environment plays and so is ultimately down to a vendor to specify. If there is an issue then it is how to propagate such an error to more than one participant process. The technical solutions for modeling this are much less of a problem than the political issue of a participant agreeing to allow observers (in this case the other participant processes) to observe that it has generated an out of bounds message. So the solution lies in providing the choreography designer with the necessary tools to be able to model it in a global and a local way. In the former the other participants may need to be informed in which case this could be viewed as an alternative path and in the latter a local matter to be resolved by the owner of the process.

Some important technical considerations

From a technical point of view modeling time in a consistent manner is easy to say and much harder to do. However the process that the time relates to are generally in one location and so their notion of a timeout, so when to start the clock, is from their perspective.

The other issue is how to model the sends/receives as interacts. Since an interact collapses the notion of send/receive into one conceptual entity we need to be able to do the same thing but with an interact. This is really a question for the spec editors.

Propagation we shall leave for group discussion as part and parcel of the wider debate.

Some important business semantic considerations

From a semantic perspective there is a need to differentiate between normal behaviour and behaviour that deviates from the norm. So a timeout might be seen (and this could be dependent on context for a choreography and equally may imply a different choreography) as a marked (i.e. well named) channel or marked message, which is sent on the marked (i.e. well named) channel for the purpose of clarity where clarity is based on the context in which a choreography is created (i.e. fixprotocol or SWIFT etc). It is also the case that timeouts need to be distinguishable from each other – less in terms of duration and more in terms of the impact they have; which path they choose for example.

The same can be said of errors.