RE: Different Levels of Reliable Messaging from Burdett, David on 2002-12-14 (www-ws-arch@w3.org from December 2002)

From: Burdett, David <david.burdett@commerceone.com>
Date: Sat, 14 Dec 2002 11:17:53 -0800
To: "'Ricky Ho'" <riho@cisco.com>, "Burdett, David" <david.burdett@commerceone.com>, "Burdett, David" <david.burdett@commerceone.com>, www-ws-arch@w3.org
Message-ID: <C1E0143CD365A445A4417083BF6F42CC053D152E@C1plenaexm07.commerceone.com>
Ricky

See comments in line below marked with <DB2></DB2>

David

-----Original Message-----
From: Ricky Ho [mailto:riho@cisco.com]
Sent: Friday, December 13, 2002 6:13 PM
To: Burdett, David; Burdett, David; www-ws-arch@w3.org
Subject: RE: Different Levels of Reliable Messaging


Thanks David, see my followup questions (embedded)


>The "ack" doesn't need to be per-message based.  I can send an ack for a
>bunch of message (of course, sequence number is used).
><DB>Agreed, but now you are adding in an extra level of complexity
(sequence
>number) that often won't be needed. What I would suggest is that you split
>this into another two levels:
>1. Sequencing Support. This is a protocol, built on top of reliable
>messaging that ensures that messages arrive in the sequence they were sent.
>2. Reduced Frequency Acknowledgement Messages. You could then vary the
>reliable messaging protocol so that a request for an acknowledgement is
>every so many messages and if it is not received, then corrective action is
>taken.
></DB>

<Ricky>
I was presuming that sequence ordering to be part of reliable 
messaging.  Seems like you consider this as a separate layer.
</Ricky>


>The "time expiry" is unreliable because clocks may be unsync.
><DB>Absolutely right.
>
>The "cheap", but as you say inaccurate way to do this is to set and compare
>"expires at" using a local system clock. The fact that it is an
>approximation to the true time is often not a big issue especially if you
>are doing end-to-end acks where the time between sending a message and when
>it expires is long compared to the clock accuracy (e.g. a day). Even so, it
>is probably good practice that Reliable Messaging solutions take this
>uncertainty in the accuracy of the time into account and extend the
"expires
>at" to some time beyond the nominal expiry time.
>
>If time accuracy is *fairly* critical, then the sender and receiver of a
>message SHOULD agree to keep their clocks accurate using, for example,
>protocols such as the Network Time Protocol. If accuracy is *really*
>critical then you can include in the message the accuracy to which the
>system at the destination MUST keep its clocks. If the system does not keep
>its clocks accurate or cannot keep them accurate enough, then the
>destination should reject the message and not process it.</DB>

<Ricky>
Maybe I misunderstood the purpose of expiration time.  I guess your purpose 
of time expiry mechanism is for reducing the "in-doubt" condition.  So if A 
send a message to B which is valid within T minutes.  And A doesn't receive 
an ACK from B.  So A keep resending but still doesn't get back the ACK 
after (T+10) minutes.  Can A at this point simply gives up and conclude 
that the message is undelivered ?  All I try to say that "A cannot draw 
that conclusion".  Sorry, I agree this is irrelevant with the clock sync 
problem.
So I see the expiration time is purely an application level semantic (e.g. 
you send a bid response which is valid within one day).  I don't see what 
role the expiration time play at the RM level.  I must be missing something 
here.
</Ricky>
<DB2>I thinkj the expiration time is important for RM, and here's why.

All RM is based waiting for an ack and repeatedly resending the original
message if you don't get one. The problem is when do you stop resending and
give up. At some point you have to stop but when. There are two ways of
doing this.
1. Stop after a fixed number of retries (which is what ebXML MS does), or
2. Stop only after the message can no longer validly be processed.

The problem with the first approach is that if you say stop sending after 3
retries (i.e. 4 sends of the message in total), then it is still quite
possible that, if the destination system was down, and then came back up it
could pick up the message and process it - this is quite normal behavior.
You could then get into the situation where:
1. The sender sent the message be gave up resending after, say 20 minutes as
no reply was received, then
2. The sender reports to the application that deliver failed
3. The destination restarts, finds the message, sends the ack and starts
processing it.
4. The sender receives the ack and has to tell the application that the
message for which it had just reported a delivery failure had actually been
received and was processed - not a desirable outcome

Alternatively by specifying a time out and basing the retries on that you
know, with a high degree of certainty, that even if the message is picked
up, it won't be processed and therefore you are much less likely to have to
report the a delivery failure and then have to reverse it.

The question is how do you decide what the timeout value should be. There
are again two ways of doing this:
1. Use a value that is driven by the application - i.e. it is a business
value, or
2. Use a value that is determined based on speed of the transport protocol
and therefore how long you expect it to take to get the ack.

I think that it is always a good idea to use a business driven value if one
is available, but it really is an implementation decision.
</DB2>

>I don't think there should be a step 4 in LEVEL 3.  Step 3 should say "Have
>you receive the message ?  If not, forget the message afterwards"
><DB>I don't think you can always say this. For example if you want to place
>an order and there is only one supplier, then even if you message failed,
>you might want to resend it if the connection became available later. In
>this case, the conent/payload/body of the message might be identical but in
>other ways it was a completely new message.</DB>

<Ricky>
What I'm trying to prevent is the situation that the request message 
arrives the receiver after the query (so the receiver respond: "I haven't 
got it"), but before the "forget message" get there.  In this case, the 
message has been delivered, but the sender think it hasn't.

Going back to your example, you should send a query to the supplier "Have 
you receive my purchase order with message id=12345 ?  if you haven't, 
ignore that message if it arrives later".
If you get back an answer "NO", resend your same purchase order with a new 
message id=98765.

However, if you send a separate "forgot" message after you receive a 
"NO".  Then it is possible that the receiver get 2 purchase order (one with 
message id = 12345 and the other with id = 98765).
</Ricky>
<DB2>There is actually a little mistake in Level 3 as I described it which
avoids the problem you describe. Basically you only attempt a recovery
*after* you have given up, and you only give up when the message has
expired. In this case, even if the message arrived after the query, it
should be rejected as it arrived too late.</DB2>


>I think LEVEL 5 should be done at the transaction layer, below
>choreography, but above reliable messaging.  Using some 2-phase-interaction
>style like BTP.
><DB>Quite possibly. The problem with two phase commit is the action you
take
>when you geet a failure (i.e. a rollback) may not always the right one and
>often it can be impossible to do. For example, if you want to roll back a
>payment, but the payment has already gone to the bank, then its to late.
You
>have to do a reversal, or refund instead. Both of these would leave a trace
>in the records of what happened.</DB>

<Ricky>
Of course, you can always handle exception at the application level, which 
can recovered from a partial failure situation is a very application 
specific manner.  However, this can complicates the application flow 
because it mixes the normal flow with exception handling logic under 
different failure scenario.

The beauty of transaction processing is that application can encapsulate 
multiple activities within a transaction block and safely assume everything 
will automatically undone.  In other words, the application doesn't need to 
worry about all failure situations.

Lets look at a simple case where A is sending a "money transfer request" to 
B, which sends a "money deposit request" to C as well as another "money 
withdrawal request" to D.  Let me illustrate the flow based on a 2-phase 
handshaking.

1) A sends "transfer" to B, and wait for "Prepared-ACK-transfer" from B
2) B sends "deposit" to C, and wait for "Prepared-ACK-deposit" from C
3) B sends "withdrawal" to D and wait for "Prepared-ACK-withdrawal" from D
4) After B got back all the "Prepared-ACK" from C and D, it send back the 
"Prepared-ACK-transfer" to A

5) A sends "commit" to B, and wait for "Commited-ACK-transfer" from B
6) B sends "commit" to C, and wait for "Commited-ACK-deposit" from C
7) B sends "commit" to D and wait for "Commited-ACK-withdrawal" from D
8) After B got back all the "Commit-ACK" from C and D, it send back the 
"Commited-ACK-transfer" to A

</Ricky>
<DB2>What you describe in this example is a Business Process. It is NOT, in
my opinion, reliable messaging as you make the return of one ack dependent
on the receipt of two other acks. The bottom line is that you can only do
transaction processing if you KNOW that complete rollback of the state at
the sender and receiver is possible. Sometimes it is, and sometimes it isn't
which is why you have to determine how you do the recovery at the
application level.</DB2>

By the way, you have raised some very good points David !

Best regards,
Ricky
Received on Saturday, 14 December 2002 14:17:37 UTC