Re: Proposed text on reliability in the web services architecture from Walden Mathews on 2003-01-17 (www-ws-arch@w3.org from January 2003)

From: Walden Mathews <waldenm@optonline.net>
Date: Fri, 17 Jan 2003 10:15:36 -0500
To: Assaf Arkin <arkin@intalio.com>, Peter Furniss <peter.furniss@choreology.com>, "Champion, Mike" <Mike.Champion@SoftwareAG-USA.com>, www-ws-arch@w3.org
Message-id: <004f01c2be3b$4acd1f60$1702a8c0@WorkGroup>
Arkin,

> > True that RM plays no part in selecting an incremental strategy for
> > setting end-state.  What I'm saying is that RM can bolster such an
> > approach toward one definition of reliability, and that a
> > different approch
> > in the application can attack reliability end-to-end, more effectively.
> >
> > Do we agree that HTTP over TCP/IP already has RM incorporated?
>
> Let me explain something about RM and you tell me what you think.
>
> Scenario 1: I have an application that sends a message for asynchronous
> delivery. The message gets queued. From the queue it is delivered to an
> intermediary. The intermediary queues it and delivers it to its
destination.
> The destination queues it and then processes it.
>
> Scenario 2: I send a message for immediate acceptance by its destination.
>
> In scenario 1 all the connections and queues are 100% reliable (a dream
but
> let me dream). The message gets delivered to the intermediary fine, but
the
> desgination is currently offline. The intermediary relies on TCP for
> offering RM. It doesn't attempt to re-deliver the message, it just
discards
> the message. The message never gets delivered.
>
> In scenario 2 I decide that I get all my RM needs from TCP. So I do have
RM.
> I just don't do any additional work from what TCP provides.
>
> There's actually more to RM than what TCP providers, so there are valid
> cases where I need more than TCP even in scenario 2. (Like ordering of
> messages from the client, ordering of messages in multicast, etc) But I'm
> just trying to point that you need RM, and in some cases you get all your
> needs met by TCP and in some cases you don't, and in some cases you want
to
> build the application so regardless of who gives you the RM, it looks the
> same (decouple the app from the transit mechanism).
>
> I think the real value of an RM specification is if it lets me decouple
the
> app from the transit mechanism, so whether I use scenario 1 or scenario 2,
I
> still develop my app the same way.

I think you may be drifting off the reliability subject and into uniform
interfaces.

From a reliability standpoint, I know of applications for
which many intervening queues is not a good fit because is postpones
the immediate feedback the app is looking for, namely that the service
is offline right now.


> > I disagree about this case.  Clients recognize audit as part of their
> > scope, and the journal is a tool in that, hence its visibility.
Moreover,
> > the client maintains its own version of the journal.  You can decouple
> > implementation details of client and server, but in this case, the
journal
> > is not an implementation detail.
>
> That's assuming the client audits what the server audits, which is not
> always the case. Even if you use Quicken to balance your check, the bank
> keeps much more information in their journal than you keep in Quicken.

No, it's not assuming any such thing.  It's simply that clients and servers
both talk about ledgers and deposits and withdrawals, and that detail is
in the interface, not hidden in the implementation.

>
> > Maybe I misunderstood you.  I thought you were versioning the
> > repository (the account), but actually you're versioning the messages?
> > If you're trying the same operation but with a higher version number,
> > where is "sameness" recorded?  Versioning seems irrelevant.  Maybe
> > I still don't understand.
>
> I am actually versioning the respository (the account), but under the
> following requirements:
>
> - Update x with version v can occur only if the current version v' is less
> than v (but no need for v=v'+1)
> - If update x fails due to versioning, get a new version number and
current
> state and try again

I think we're sidetracked on this, but if you feel there's a point to be
made about RM, go ahead.

>
>
> > This depends on the user's (bank or customer) requirement, which will
> > vary from user to user, and cannot always be covered by a statistical
> > approach to reliability.  Hence my statement above that "not all
problems
> > are modeled well by statistics".  Some aren't, and still have to be
> > accounted for.  RM alone is not a good fit for those, would you agree?
>
> Statistical approach to business. As long as you might have a bug there
(and
> there's always a bug hidden there somewhere), even with 100% reliability
of
> messaging/coordination you will get incorrect balances every so often. So
> you need to deal with that possibility.

Right, client implementations have to take that stance to be reliable.

>
>
> > My point is that you need to do something 100% of the time, even if
> > only 10% of messages is lost.  Job 1 is identifying the lost
> > messages, which
> > is O(n), where n = number of messages.  No?
>
> If message m precedes message m', but message m' does not depend on
message
> m, and you lose message m then life goes on.
>
> If message m precedes message m', and message m' can only be processed
after
> message m, then you know which message is missing.
>
> One way to order message is to always say m' follows m, and identify where
m
> is coming from, so you can ask for redelivery of m.
>
> There are other strategies, it all depends on how you want to order
> messages.

Some applications depend on receiving (at some point) all the messages
sent to them, not just the more recent ones.  For example, a stock ticker
store-and-forward system ("ticker plant" where I come from) needs all
the ticks, even though it gets summaries (idempotent) periodically.
Whether it is successfully receiving 99% or 90% or 50%, it has to implement
the same recovery strategy, which involves knowing which messages are
received and which are not.

>
> > I'm sure your experiment was real.  The dream I'm talking about is the
> > one in which this somehow relieves the application of its burden.
Perhaps
> > you don't think it does.
>
> Let's go back to TCP. TCP offers RM. Does TCP make all the interaction
100%
> reliable?
>
> My application deals with errors. It has to because there is no guarantee
> that errors would not occur.

So then your application requests retransmissions.  How can you
call that "decoupled from the layer that deals with reliable messaging"?

>
> But, it is decoupled from the layer that deals with reliable messaging. So
> when I use TCP, the TCP layer is sufficient. And when I use UDP, I have a
> UDP RM. And when I use IP multicast, I have an IP multicast RM. In all
three
> cases my application still have to deal with errors, and is totally
> decoupled from the RM implementation.

In TCP applications I've written, the more the application tries to take
advantage of TCP, the more it embroils itself in internal TCP states.
I think this "total decoupling" is total fallacy.  Or a dream, whichever.

>
> Now, there is such a thing as false errors. For my application, it a
message
> was not delivered first but the RM can redeliver it, that's a false error.
> If it can avoid dealing with false errors it would work faster. It would
> work faster because the RM can optimize redelivery.

I don't like the "false error" designation.  It's an error that's been
recovered.  There are certain types of dataflow in which an infrastructure
can optimize this, because it views the flow at finer granularity.  This
is not the case for the examples we've been discussing.

>
> In this particular case, if messages 1..n send by different applications
on
> the same machine do not get delivered, the RM may redeliver them in bulk,
> which is faster than the application can deal with it. In fact, the
> redelivery for all messages occurs way before the application even detects
> that one message was lost (50ms compared to 5s).

Come on, the RM layer is introducing latency.  An application is just
as able to say "resend messages 5 through 10" as is the RM layer.  This
is not where the optimization comes from.

>
> You can construct a few tests and you will see that RM - whether it's TCP
> RM, or RM for something other than TCP, or RM over TCP because queues are
> used - is a more efficient strategy than having the application deal with
> false errors. But you will also find that even if the RM is 100% perfect,
it
> doesn't alleviate the need to deal with errors.

Streams are an effective optimization for certain kinds of dataflow, but
let's keep the tradeoffs in mind too.  It's not free lunch.

>
> I think we both agree that RM, as it names says, addresses reliability of
> message delivery. It does not address other aspects of reliability. It's
not
> a cure-all, it's a "make this part of the system better".

It optimizes, in other words.  Sometimes.  And for it to fit the arbitrary
application's needs well, it needs to be tailored or configured carefully,
which means the application developers still have to know about the
network as it pertains to their application, in detail.  It's Waldo all over
again.

Walden
Received on Friday, 17 January 2003 10:15:41 UTC