RE: Proposed text on reliability in the web services architecture from Assaf Arkin on 2003-01-17 (www-ws-arch@w3.org from January 2003)

From: Assaf Arkin <arkin@intalio.com>
Date: Fri, 17 Jan 2003 12:04:16 -0800
To: "Walden Mathews" <waldenm@optonline.net>, "Peter Furniss" <peter.furniss@choreology.com>, "Champion, Mike" <Mike.Champion@SoftwareAG-USA.com>, <www-ws-arch@w3.org>
Message-ID: <IGEJLEPAJBPHKACOOKHNAEOBDAAA.arkin@intalio.com>
> I think you may be drifting off the reliability subject and into uniform
> interfaces.

I think we all agree uniform interfaces are a great thing. Or I could
rephrase that and say we agree that abstracting the delivery mechanisms is a
good thing, so whether I'm doing HTTP with TCP reliability or SMTP with
something-else reliability, it all looks the same.

> >From a reliability standpoint, I know of applications for
> which many intervening queues is not a good fit because is postpones
> the immediate feedback the app is looking for, namely that the service
> is offline right now.

I would say that's a QoS issue. You don't want to circumvent the queue if
the only way to reliabliy process your request is to use a queue. But you
want to know if it can be processed. You don't want to send a message and
get a reply 24 hours later saying, "sorry I'm too busy for the next two
days, please come again". You want to send a synchronous request (what ebXML
calls a ping) to ask "can you process?" and then send the request.


> No, it's not assuming any such thing.  It's simply that clients
> and servers
> both talk about ledgers and deposits and withdrawals, and that detail is
> in the interface, not hidden in the implementation.

I've done stock trading applications I can certainly tell you that both
client and server need to keep a ledger and the server ledger is 10x more
complex. So may the client can send 10% of the information to the server,
but that's still part of the server-side state change.


> I think we're sidetracked on this, but if you feel there's a point to be
> made about RM, go ahead.

We are sidetracking. I was actually making a point about a form of
communication that has RM underneath, not the RM itself.


> Some applications depend on receiving (at some point) all the messages
> sent to them, not just the more recent ones.  For example, a stock ticker
> store-and-forward system ("ticker plant" where I come from) needs all
> the ticks, even though it gets summaries (idempotent) periodically.
> Whether it is successfully receiving 99% or 90% or 50%, it has to
> implement
> the same recovery strategy, which involves knowing which messages are
> received and which are not.

Yep. That's why I said there are multiple strategies, it all depends on what
you want to achieve. You may ask for best-effort, which means you either
receive it or not. You may to always get the message so loss requires
resend. You may even depend (actually most applications do) on the strict
ordering, so if X was sent before Y you need X before you can process Y.


> So then your application requests retransmissions.  How can you
> call that "decoupled from the layer that deals with reliable messaging"?

My application doesn't request retransmission. When you use TCP and a packet
get lost do you request retranmission? You let the RM layer deal with it.
What you have elected to do is use a messaging layer that guarantees all
messages within a sequence be delivered in order, and the RM deals with
that.

Now, if I send/receive multiple messages then TCP solves the packet problem
by making sure each message is either full or ignored, but it doesn't solve
the multi-sequence message problem. So you need something on top of TCP
whenever multiple messages are involved.


> In TCP applications I've written, the more the application tries to take
> advantage of TCP, the more it embroils itself in internal TCP states.
> I think this "total decoupling" is total fallacy.  Or a dream, whichever.

I'm not sure I understand. You can't hide complexity but you can abstract
it. So the RM solution is first about abstracting it then about decreasing
it.


> I don't like the "false error" designation.  It's an error that's been
> recovered.  There are certain types of dataflow in which an infrastructure
> can optimize this, because it views the flow at finer granularity.  This
> is not the case for the examples we've been discussing.

The term false error refers to an error in one layer that could be dealt
with more effectively at another layer.

For example, if a packet gets lost when you do TCP you don't hear about it.
TCP cleans it up by asking for restransmission, holding on to the packets
already received, and then delivering them all in order.

It's called false error because TCP only needs to request a resend of the
packet. On the other hand, if the application had to deal with it, the
application would have to request a resend of the remainder of the message
since it doesn't get access to the queued packets. Having the application
deal with it would be less efficient.

By the time the application gets to send the resend request the packet may
have already arrived and TCP could have delivered all the queued packets. So
in a sense you are wasting time dealing with a situtation that TCP could
recover from more easily. That's why it's called "false error" (my term,
though I'm borrowing it from a different domain in communication that uses
it in pretty much the same way).

WS RM would do the same thing except for a sequence of messages.


> Come on, the RM layer is introducing latency.  An application is just
> as able to say "resend messages 5 through 10" as is the RM layer.  This
> is not where the optimization comes from.

See above for why not. If you lose message X but you already have message Y
queued, all you need is to request message X again. You queue Y because X
and Y are ordered and have to be delivered in order. You can have the
application deal with it. Fine. You're doing RM at the application level.
But whatever layer does the ordering is the most efficient layer to deal
with retranmission. And whatever layer does the ordering is the RM, whether
you get that as part of a WS stack or write it as part of the application.

The definition of reliable is (very loosely and not precise so don't kill me
on the wording): you deliver the message exactly once (deliver to the
application, you can send/recieve it multiple times though), you deliver a
message only if the message was actually send (no suprious messages), you
deliver the message in some designated order.


> Streams are an effective optimization for certain kinds of dataflow, but
> let's keep the tradeoffs in mind too.  It's not free lunch.

You are right.


> > I think we both agree that RM, as it names says, addresses
> reliability of
> > message delivery. It does not address other aspects of reliability. It's
> not
> > a cure-all, it's a "make this part of the system better".
>
> It optimizes, in other words.  Sometimes.  And for it to fit the arbitrary
> application's needs well, it needs to be tailored or configured carefully,
> which means the application developers still have to know about the
> network as it pertains to their application, in detail.  It's
> Waldo all over
> again.

The only two questions I have are:

1. Do you need RM?
2. Do you benefit from abstracting it?

I think that in some situations I can get better implementation if I elect
to use an RM. It's not a free lunch, but it buys me the ability to
send/process messages asynchronously which is not an "always good thing" but
is often something I would like to do.

I also think that abstracting is a good thing becuase I can get that layer
componentized, I can reuse it across multiple applications, or even buy it
from a 3rd party, or download it as open source. But to be abstract we need
some generalized API (or even two or three).

I can get that from any MOM, and I have an abstract API I can use (JMS), but
I think it's better if I also have some protocol that MOM X could use to
talk to MOM Y, and that I could use to talk to the MOM as an intermediary
Web service, so I can have a consistent way to define that API as a Web
service.

arkin

>
> Walden
Received on Friday, 17 January 2003 15:05:51 UTC