RE: Proposed text on reliability in the web services architecture from Assaf Arkin on 2003-01-16 (www-ws-arch@w3.org from January 2003)

From: Assaf Arkin <arkin@intalio.com>
Date: Thu, 16 Jan 2003 13:43:24 -0800
To: "Walden Mathews" <waldenm@optonline.net>, "Peter Furniss" <peter.furniss@choreology.com>, "Champion, Mike" <Mike.Champion@SoftwareAG-USA.com>, <www-ws-arch@w3.org>
Message-ID: <IGEJLEPAJBPHKACOOKHNGEMNDAAA.arkin@intalio.com>
> True that RM plays no part in selecting an incremental strategy for
> setting end-state.  What I'm saying is that RM can bolster such an
> approach toward one definition of reliability, and that a
> different approch
> in the application can attack reliability end-to-end, more effectively.
>
> Do we agree that HTTP over TCP/IP already has RM incorporated?

Let me explain something about RM and you tell me what you think.

Scenario 1: I have an application that sends a message for asynchronous
delivery. The message gets queued. From the queue it is delivered to an
intermediary. The intermediary queues it and delivers it to its destination.
The destination queues it and then processes it.

Scenario 2: I send a message for immediate acceptance by its destination.

In scenario 1 all the connections and queues are 100% reliable (a dream but
let me dream). The message gets delivered to the intermediary fine, but the
desgination is currently offline. The intermediary relies on TCP for
offering RM. It doesn't attempt to re-deliver the message, it just discards
the message. The message never gets delivered.

In scenario 2 I decide that I get all my RM needs from TCP. So I do have RM.
I just don't do any additional work from what TCP provides.

There's actually more to RM than what TCP providers, so there are valid
cases where I need more than TCP even in scenario 2. (Like ordering of
messages from the client, ordering of messages in multicast, etc) But I'm
just trying to point that you need RM, and in some cases you get all your
needs met by TCP and in some cases you don't, and in some cases you want to
build the application so regardless of who gives you the RM, it looks the
same (decouple the app from the transit mechanism).

I think the real value of an RM specification is if it lets me decouple the
app from the transit mechanism, so whether I use scenario 1 or scenario 2, I
still develop my app the same way.


> I disagree about this case.  Clients recognize audit as part of their
> scope, and the journal is a tool in that, hence its visibility.  Moreover,
> the client maintains its own version of the journal.  You can decouple
> implementation details of client and server, but in this case, the journal
> is not an implementation detail.

That's assuming the client audits what the server audits, which is not
always the case. Even if you use Quicken to balance your check, the bank
keeps much more information in their journal than you keep in Quicken.

> Maybe I misunderstood you.  I thought you were versioning the
> repository (the account), but actually you're versioning the messages?
> If you're trying the same operation but with a higher version number,
> where is "sameness" recorded?  Versioning seems irrelevant.  Maybe
> I still don't understand.

I am actually versioning the respository (the account), but under the
following requirements:

- Update x with version v can occur only if the current version v' is less
than v (but no need for v=v'+1)
- If update x fails due to versioning, get a new version number and current
state and try again


> This depends on the user's (bank or customer) requirement, which will
> vary from user to user, and cannot always be covered by a statistical
> approach to reliability.  Hence my statement above that "not all problems
> are modeled well by statistics".  Some aren't, and still have to be
> accounted for.  RM alone is not a good fit for those, would you agree?

Statistical approach to business. As long as you might have a bug there (and
there's always a bug hidden there somewhere), even with 100% reliability of
messaging/coordination you will get incorrect balances every so often. So
you need to deal with that possibility.


> My point is that you need to do something 100% of the time, even if
> only 10% of messages is lost.  Job 1 is identifying the lost
> messages, which
> is O(n), where n = number of messages.  No?

If message m precedes message m', but message m' does not depend on message
m, and you lose message m then life goes on.

If message m precedes message m', and message m' can only be processed after
message m, then you know which message is missing.

One way to order message is to always say m' follows m, and identify where m
is coming from, so you can ask for redelivery of m.

There are other strategies, it all depends on how you want to order
messages.


> I'm sure your experiment was real.  The dream I'm talking about is the
> one in which this somehow relieves the application of its burden.  Perhaps
> you don't think it does.

Let's go back to TCP. TCP offers RM. Does TCP make all the interaction 100%
reliable?

My application deals with errors. It has to because there is no guarantee
that errors would not occur.

But, it is decoupled from the layer that deals with reliable messaging. So
when I use TCP, the TCP layer is sufficient. And when I use UDP, I have a
UDP RM. And when I use IP multicast, I have an IP multicast RM. In all three
cases my application still have to deal with errors, and is totally
decoupled from the RM implementation.

Now, there is such a thing as false errors. For my application, it a message
was not delivered first but the RM can redeliver it, that's a false error.
If it can avoid dealing with false errors it would work faster. It would
work faster because the RM can optimize redelivery.

In this particular case, if messages 1..n send by different applications on
the same machine do not get delivered, the RM may redeliver them in bulk,
which is faster than the application can deal with it. In fact, the
redelivery for all messages occurs way before the application even detects
that one message was lost (50ms compared to 5s).

You can construct a few tests and you will see that RM - whether it's TCP
RM, or RM for something other than TCP, or RM over TCP because queues are
used - is a more efficient strategy than having the application deal with
false errors. But you will also find that even if the RM is 100% perfect, it
doesn't alleviate the need to deal with errors.

I think we both agree that RM, as it names says, addresses reliability of
message delivery. It does not address other aspects of reliability. It's not
a cure-all, it's a "make this part of the system better".

arkin

>
> Walden
Received on Thursday, 16 January 2003 16:45:02 UTC