Yes we have some bananas.

Modeling errors, timeouts and their propagation in a WS-CDL

Authors: Steve Ross-Talbot, Enigmatec Corporation Ltd.

Date: 8^th June 2004

Definitions

In any interaction between processes (and by processes I mean a more generic definition than just web services but given it is a superset it applies equally to web services) if the observable scope of a process is at the point at which interaction is observed then we say that we can observe the external behavior of that process. For example if a process A sends an order to another process, B, and that process enacts some logic to determine what to do next; it could determine it was for a premium customer - in which case it send the order to a process called C or it may determine that is for an ordinary customer - in which case it sends the order to a process called D. All we can observe is the observable interaction of passing the order from one process to another. Thus valid sequences in the overall grammar that represents the choreography are (I'm going to use a pseudo pi-calculus to avoid ambiguity):

SYSTEM = A | B | C | D

A = a.A1.x’.A

B = x.(B1.y’ + B2.z’).B

C = y.C1.C

D = z.D1.D

Where "|" represent the combination in parallel of the process that are interacting. Thus A, B, C and D are operating in parallel (analogous to roles in CDL).

Where “+” represents a choice. and B exhibits an observable choice of either sending on channel y’ which C processes or sending on channel z’ for D to process.

Where "." represent sequence, that is and order is received on a and then A does it's stuff and then sends an order on x' which B receives on x and then sends the order on y' which C receives on y (or of course sends on z' and D receives on z.

In our example we define A, B, C, D separately (their end point behavior) to make it simple to see recursive behavior. Thus “A”, having received a message on channel “a” moves into an observable state called “A1” and sends a message on channel x’ after which it behaves like an “A” again. The other processes are broadly analogous.

Error handling and timeouts

Now let's move onto error handling. There are two levels that we need to consider. The first is dealing with exceptional circumstances arising from a failure in A, B, C or D and the second arises from out of bound message exchanges; these are messages for which there is no definition in the choreography description that is able to handle them in the current context in which a context is a collaboration group.

Failures could occur for a number of reasons. Firstly a failure could occur because B decides that the customer submitting the order does not have a high enough credit rating. Secondly it could fail because a communication channel between B and D is broken in which case B receives a timeout from somewhere to indicate this failure. Or it could happen because C throws an internal exception and passes the exception back in some predefined format as a valid message exchange.

The classification of failure is thus:

Failed because of a business exception

Failed because of a connection failure (timeout)

Failed because of an end point exception (business exception)

As far as A, B, C and D are concerned should an error occur at A, B, C or D at any time it may result in either a different message being send (on another channel created for the purpose) or no message being sent. So on the one hand we have the presence of a message (a business exception) and on the other we have the absence of a message (a timeout).

The absence of a message could be viewed as a message from a timeout, the presence of a message (any business exception message) can be modeled analogously. To do this properly we need to redefine our SYSTEM to include all of the business level handshaking that might be required. Thus we redefine out SYSTEM as follows:

SYSTEM = A | B | C | D

A = a.A1.x’.r_bA

Now A waits on a response from B on a channel called r_b

B = x.(B1.y’ + B2.z’).(r_c+ r_d).r’_b).B

Now B send a response on r’_bbut only after it has received a response on either r_c or r_d

C = y.C1.r’_c.C

D = z.D1.r’_d.D

C and D are analogous to A.

An approach for dealing with timeouts would be to insert the necessary choices in observable behavior based on some abstract timer process that we shall call T.

Now we can rewrite the SYSTEM as follows:

SYSTEM = A | B | C | D | !T1

T = start.CLOCK.stop’

A = a.A1.x’.start’.(r_bA + stop.0)

B = x.(B1.y’ + B2.z’).start.(r_c + r_d).r’_b.B + stop.0

C = y.C1.r’_c.C

D = z.D1.r’_d.D

Where “!” is the replication operator which applied to T results in as many T processes as needed being created.

Where “start” is a message that is received by T that starts a clock for an appropriate amount of time and then sends a “stop” message to whoever called the “start”. Obviously there is some magic here to deal with name matching amongst A, B, C and D to ensure that they have private (scoped) channels to their T.

Where the general default handling of a timeout is to stop (the “0” term). Because A and B are the only processes in this SYSTEM that receive responses they are the only ones that need to model the timeouts.

The system that we describe manages the business transaction from A through B to C or D since the passing back from D or C all the way to A has been modeled.

We could take a different approach and have B react asynchronously to the business transactions progress such that we make the SYSTEM deal with business exceptions like a cancel initiated through A to B.

How does this or should this apply to WS-CDL

WS-CDL has no notion of an individual send and individual receive. What WS-CDL does is to model the pairing of sends and receives as interacts.

In section 2.4.8.1, entitles “Exception block” it states that “Timeout errors, for example an Interaction did not complete within a required timescale”. In the same section it states that “Within a Choreography only one Exception Work Unit MAY be matched. When an Exception Work Unit matches, it enables its appropriate activities for recovering from the fault.”. Therefore when a timeout occurs if an Exception Work Unit matches the fault then that Work Unit is in effect the pi process that would handle the consequences of the timeout.

In section 2.5.2.4, entitles “Interaction Life-line” it states that “The time-to-complete timeout identifies the timeframe within which an Interaction MUST complete. If this timeout occurs, after the Interaction was initiated but before it completed, then a fault is generated” where an interaction is defined as:

interaction name="ncname"

channelVariable="qname"

operation="ncname"

time-to-complete="xsd:duration"?

align="true"|"false"?

initiateChoreography="true"|"false"? >

<participate relationship="qname"

fromRole="qname"

toRole="qname" />

<exchange messageContentType="qname"

action="request"|"respond" >

</exchange>*

The addition of a “time-to-complete” attribute is we would suggest is the equivalent of the “start” message to a replicated private timer process T.

Given that WS-CDL is a description language there is quite a lot of machinery required to project end point behavior that can deal with time. This will be a consideration for many in attempting to build examples based on WS-CDL in the future.

Propagation

Propagation of such faults can be handled by modeling any further interacts with the various roles within the exception block work unit. As far as we can tell no special considerations for those work units apply and so the full power of WS-CDL is available. Furthermore we would suggest that all exception and faults can be propagated in the same way.

We can see no sense in adding anything that makes this more explicit as it will complicate the language for very little gain. Modeling propagation as interacts in a an exception block work unit has the advantage of being able to control how contractually partners wish to view exceptions and deal with them.

Some important business semantic considerations

From a semantic perspective there is a need to differentiate between normal behavior and behavior that deviates from the norm. So a timeout might be seen (and this could be dependent on context for a choreography and equally may imply a different choreography) as a marked (i.e. well named) channel or marked message, which is sent on the marked (i.e. well named) channel for the purpose of clarity where clarity is based on the context in which a choreography is created (i.e. fixprotocol or SWIFT etc). It is also the case that timeouts need to be distinguishable from each other – less in terms of duration and more in terms of the impact they have; which path they choose for example.

The same can be said of errors.

Recommendation

We would recommend closing the action for this item and any issues directly related to it.

References

“Bananas: Handling errors and timeouts in a choreography” Monica Martin, Steve Ross-Talbot 3^rd March 2004. http://lists.w3.org/Archives/Public/public-ws-chor/2004Mar/att-0005/Bananas.htm