Date: 8th June 2004
In any interaction between processes (and by processes I mean a
more generic definition than just web services but given it is a superset it
applies equally to web services) if the observable scope of a process is at the
point at which interaction is observed then we say that we can observe the
external behavior of that process. For example if a process A sends an order to
another process, B, and that process enacts some logic to determine what to do
next; it could determine it was for a premium customer - in which case it send
the order to a process called C or it may determine that is for an ordinary
customer - in which case it sends the order to a process called D. All we can
observe is the observable interaction of passing the order from one process to
another. Thus valid sequences in the overall grammar that represents the choreography
are (I'm going to use a pseudo pi-calculus to avoid ambiguity):
SYSTEM
= A | B | C | D
A
= a.A1.xÕ.A
B
= x.(B1.yÕ + B2.zÕ).B
C
= y.C1.C
D
= z.D1.D
Where "|" represent the combination in parallel of the
process that are interacting. Thus A, B, C and D are operating in parallel
(analogous to roles in CDL).
Where Ò+Ó represents a choice. and B exhibits an observable choice
of either sending on channel yÕ which C processes or sending on channel zÕ for
D to process.
Where "." represent sequence, that is and order is
received on a and then A does it's stuff and then sends an order on x' which B
receives on x and then sends the order on y' which C receives on y (or of
course sends on z' and D receives on z.
In our example we define A, B, C, D separately (their end point
behavior) to make it simple to see recursive behavior. Thus ÒAÓ, having
received a message on channel ÒaÓ moves into an observable state called ÒA1Ó
and sends a message on channel xÕ after which it behaves like an ÒAÓ again. The
other processes are broadly analogous.
Now let's move onto error handling. There are two levels that we
need to consider. The first is dealing with exceptional circumstances arising
from a failure in A, B, C or D and the second arises from out of bound message
exchanges; these are messages for which there is no definition in the
choreography description that is able to handle them in the current context in
which a context is a collaboration group.
Failures could occur for a number of reasons. Firstly a failure
could occur because B decides that the customer submitting the order does not
have a high enough credit rating. Secondly it could fail because a
communication channel between B and D is broken in which case B receives a timeout
from somewhere to indicate this failure. Or it could happen because C throws an
internal exception and passes the exception back in some predefined format as a
valid message exchange.
The classification of failure is thus:
Failed
because of a business exception
Failed
because of a connection failure (timeout)
Failed
because of an end point exception (business exception)
As far as A, B, C and D are concerned should an error occur at A,
B, C or D at any time it may result in either a different message being send
(on another channel created for the purpose) or no message being sent. So on
the one hand we have the presence of a message (a business exception) and on the other we have the absence
of a message (a timeout).
The absence of a message could be viewed as a message from a
timeout, the presence of a message (any business exception message) can be
modeled analogously. To do this properly we need to redefine our SYSTEM to
include all of the business level handshaking that might be required. Thus we
redefine out SYSTEM as follows:
SYSTEM
= A | B | C | D
A
= a.A1.xÕ.rbA
Now A waits on a response from B on a channel called rb
B
= x.(B1.yÕ + B2.zÕ).(rc + rd).rÕb).B
Now B send a response on rÕb but only after it has
received a response on either rc or rd
C
= y.C1.rÕc.C
D
= z.D1.rÕd.D
C and D are analogous to A.
An approach for dealing with timeouts would be to insert the
necessary choices in observable behavior based on some abstract timer process
that we shall call T.
Now we can rewrite the SYSTEM as follows:
SYSTEM
= A | B | C | D | !T1
T
= start.CLOCK.stopÕ
A
= a.A1.xÕ.startÕ.(rbA +
stop.0)
B
= x.(B1.yÕ + B2.zÕ).start.(rc + rd).rÕb.B +
stop.0
C
= y.C1.rÕc.C
D
= z.D1.rÕd.D
Where Ò!Ó is the replication operator which applied to T results
in as many T processes as needed being created.
Where ÒstartÓ is a message that is received by T that starts a
clock for an appropriate amount of time and then sends a ÒstopÓ message to
whoever called the ÒstartÓ. Obviously there is some magic here to deal with
name matching amongst A, B, C and D to ensure that they have private (scoped)
channels to their T.
Where the general default handling of a timeout is to stop (the
Ò0Ó term). Because A and B are the only processes in this SYSTEM that receive
responses they are the only ones that need to model the timeouts.
The system that we describe manages the business transaction from
A through B to C or D since the passing back from D or C all the way to A has
been modeled.
We could take a different approach and have B react asynchronously
to the business transactions progress such that we make the SYSTEM deal with
business exceptions like a cancel initiated through A to B.
WS-CDL has no notion of an individual send and individual receive.
What WS-CDL does is to model the pairing of sends and receives as interacts.
In section 2.4.8.1, entitles ÒException blockÓ it states that
ÒTimeout errors, for example an Interaction did not complete within a required
timescaleÓ. In the same section it states that ÒWithin a Choreography only one
Exception Work Unit MAY be matched. When an Exception Work Unit matches, it
enables its appropriate activities for recovering from the fault.Ó. Therefore
when a timeout occurs if an Exception Work Unit matches the fault then that
Work Unit is in effect the pi process that would handle the consequences of the
timeout.
In section 2.5.2.4, entitles ÒInteraction Life-lineÓ it states
that ÒThe time-to-complete timeout identifies the timeframe within which an
Interaction MUST complete. If this timeout occurs, after the Interaction was
initiated but before it completed, then a fault is generatedÓ where an
interaction is defined as:
interaction
name="ncname"
channelVariable="qname"
operation="ncname"
time-to-complete="xsd:duration"?
align="true"|"false"?
initiateChoreography="true"|"false"? >
<participate relationship="qname"
fromRole="qname"
toRole="qname" />
<exchange messageContentType="qname"
action="request"|"respond" >
<use
variable="XPath-expression"/>
<populate
variable="XPath-expression"/>
</exchange>*
The addition of a Òtime-to-completeÓ attribute is we would suggest
is the equivalent of the ÒstartÓ message to a replicated private timer process
T.
Given that WS-CDL is a description language there is quite a lot
of machinery required to project end point behavior that can deal with time.
This will be a consideration for many in attempting to build examples based on
WS-CDL in the future.
Propagation of
such faults can be handled by modeling any further interacts with the various
roles within the exception block work unit. As far as we can tell no special
considerations for those work units apply and so the full power of WS-CDL is
available. Furthermore we would
suggest that all exception and faults can be propagated in the same way.
We can see no
sense in adding anything that makes this more explicit as it will complicate
the language for very little gain. Modeling propagation as interacts in a an
exception block work unit has the advantage of being able to control how
contractually partners wish to view exceptions and deal with them.
From a semantic
perspective there is a need to differentiate between normal behavior and
behavior that deviates from the norm. So a timeout might be seen (and this
could be dependent on context for a choreography and equally may imply a
different choreography) as a marked (i.e. well named) channel or marked
message, which is sent on the marked (i.e. well named) channel for the purpose
of clarity where clarity is based on the context in which a choreography is
created (i.e. fixprotocol or SWIFT etc).
It is also the case that timeouts need to be distinguishable from each
other Ð less in terms of duration and more in terms of the impact they have;
which path they choose for example.
The same can be
said of errors.
We would recommend closing the action for this item and any issues directly related to it.
ÒBananas: Handling errors and timeouts in a choreographyÓ Monica Martin, Steve Ross-Talbot 3rd March 2004. http://lists.w3.org/Archives/Public/public-ws-chor/2004Mar/att-0005/Bananas.htm