RE: Proposed text on reliability in the web services architecture from Assaf Arkin on 2003-01-26 (www-ws-arch@w3.org from January 2003)

From: Assaf Arkin <arkin@intalio.com>
Date: Sun, 26 Jan 2003 13:45:52 -0800
To: "Walden Mathews" <waldenm@optonline.net>, "Miles Sabin" <miles@milessabin.com>, <www-ws-arch@w3.org>
Message-ID: <IGEJLEPAJBPHKACOOKHNMEIEDBAA.arkin@intalio.com>
> > That's a bad proposition. I would like to receive at some point (say 8
> hours
> > later) a message confirming whether the delivery would be made or not.
> > That's how I achieve reliablity of the application, and I
> cannot think of
> > any other way.
>
> I'd rather interact with an application that can tell me
> immediately (i.e.,
> fast enough for synchronous exchange) what the state of supply is, or a
> piece of meta-supply-state that means "not sure if we can fulfill".

I would rather interact with an application that delivers what I want within
a specified time frame. When I go to service my car I always ask for pricy
high quality parts. My mechanic is not a warehouse, they don't stock on
everything, sometimes they have it, sometimes the parts shop next door has
it, sometimes they have to order it from one of their numerous suppliers.
When I set an appointment I give them time to call their suppliers and
decide whether they can service me tomorrow with that part, or service me
tommorrow with some other part and the next day with the part I want. I know
they can always service me next month with that part. Because I give them a
few hours to determine when they can get the part I can get the part and the
best service. But as they say, your milage may vary.

I think the big divide here is that I have worked for companies that had
outstanding suppliers in both time to delivery, quality of products, not
messing up shipments and not charging overprice. And these suppliers uses
asynchronous messaging, so even if they go home to sleep at 5pm, you can
send a request at 4am in the morning and get it addressed the next business
day. And that worked well for both parts. I understand your impatience, but
for most people waiting a few hours to get back a reply about when delivery
will happen seems acceptable.


> Remember
> this RM stuff is supposed to be in service of real and robust business
> applications.  That being the case, it does no good to assume badly
> designed applications as the basis (requirements) for RM features.

On the contrary. I am routinely pointing out to the fact that reliable
applications use a variety of coordination protocols and that RM plays an
important fact in many of these protocols. If you have a badly designed
application you have a badly design application.


> Note that your paragraph above can be summarized by saying that
> reliability of the application is a matter of State Transfer.  True?

Definitely not. I would rather think of reliability as being the liklihood
of something (hardware, software, process) continuting to function over a
given period of time under the specified conditions.

How you address reliability is a different issue, and state transfer is one
of the concepts you could use to address reliability. But state transfer is
not application reliability just like RM is not application reliability.


> > In a non-perfect world messages may be lost. The fact that a message has
> > been lost means I will have to wait 8 hours to determine that. This is a
> > lousy failure detection algorithm. Why wait 8 hours?
>
> Note that this problem has been "swept" by the more important
> problem above, and its solution.  Let's optimize and solve this problem
> only once, at the application level.  Do you favor optimization?

Let's put it another way. You send an e-mail to this mailing list. That
e-mail goes through three hops to get here. One of the nodes is offline.
Your e-mail gets discarded. I assume you are fine with that. I would much
prefer that my e-mail, if one node is down to simply route the message
through a different node. I don't care which path it takes as long as it
gets there. I am asking my e-mail server to do RM.

Note that I haven't talked about resend, timeouts, etc. I just ask that it
delivers. Which approach you use is up to you, but I would rather use SMTP
than UDP to send my e-mails. Which one would you choose?


> > Let's say I do synchronous delivery using TCP. I start sending
> the message
> > and near the end the TCP connection drops. I can say "fine, I think the
> > message got there", or I can say "oh, oh, message loss". In the
> first case
> I
> > would wait 8 hours to determine whether the message was
> delivered/processed.
> > In the second case I am more responsive, I can react immediately by
> openning
> > another connection and sending it again. I am doing RM.
>
> Looking back, the 8 hours of your use case above is the time allowed
> for the service application to asynchronously advise of product supply
> state.  This, therefore, is apples and oranges again, and even your
> 8 hours assumption above fails, I think, because the supply application
> may not know about you at all.

I am assuming some common sense here. Either the supplier doesn't have to
know about me, or the supplier does and does know about me. However, I can
inform the supplier who I am exactly once and then keep sending purchase
orders routinely. And I assume the supplier could use the return address to
tell me status of order, "I don't know who you are", "I know who you are but
prefer not to sell you anything, thank you very much, please don't come
back".

I don't think I'm inventing anything here, just reflecting on how I've seen
businesses work.


> > Now let's say the receiver decides not to process messages as they come,
> > instead it queues them for later processing. The queue is not persisted.
> If
> > the TCP connection drops the message never gets to the queue.
> It will not
> be
> > delivered, so there's no ack. The sender needs to retry again. If the
> > message gets into the queue it's acked. It will possibly delivered.
>
> I think by "receiver" here you mean the supply application, not some
> server side of the RM machine.  But you're talking about messages being
> delivered and (I think) products being delivered, and it's hard to tell
> which is which.

By receiver I mean server side on the RM machine. In RM we distinguish
between:

1. Sending a message (the act of creating and firing a message, as opposed
to sending it over the wire)
2. Receiving a message (the act of getting a sent message, as opposed to
receiving it over the wire)
3. Delivering a message (the act of forwarding the received message to the
application)

A sender sends each message exactly once. It could be sent multiple times
over the wire, e.g. for resending. From the perspective of RM it is sent
exactly once, how you resend is protocol specific (some implementations
resend on demand, some use timeouts, some just keep resending all the time).

A receiver may receive a message multiple times and in any order. That
allows any medium to be used, some mediums would duplicate messages. You
simplify the medium if the medium knows nothing about the message and can't
detect duplication, but the RM does (since it does the sending and
receiving) and can remove duplication.

A receiver delivers the message exactly once, so the application can be
built with the assumption that each message would be delivered exactly once.
RM is conceptual, so if you build that logic into your software your
software combines application and RM responsibilities.

RM simply says that if a message is sent then any "correct process" will
eventually deliver. To qualify, eventually doesn't mean "indefinite period
of time", though it may sound like this. If the message expires in 5 minutes
then eventuall is in 5 minutes, if the message cannot be delivered in 5
minutes the process is no longer correct.

If it doesn't deliver the process is not correct either. In other words, if
I sent a message for delivery within 5 minutes, don't get a ack, I assume
the process is incorrect and did not deliver. Any coordination protocol
takes that into account in building a reliable application solution.
Timeouts are thus the primary means for fault detection.


> If a supply application receives my order for goods but does not
> store it safely, then we are once again talking about a brain dead
> application, and no amount of RM is going to fix that.  Let's avoid
> talking about brain dead applications, okay?

Agreed.


> > The sender cannot distinguish between a message that was not
> delivered and
> a
> > message that was not processed. So for the sender the fact that the
> message
> > has arrived at its destination fully intact warrants an ack.
>
> I think you're saying that the sending (requesting) application wants
> an acknowledgment that a message was delivered.  I think you're wrong
> about what it wants.  It wants a state transfer.  An end-to-end thingy.

I am saying that acknowledgment of delivery could occur way before state
transfer. You probably send me this e-mail and expect a reply within 5
minutes. I would assume the reply is the "state transfer".

What if I just went to see a movie?

Now let's say you had to ways of sending me a message. You could do an HTTP
operation, but then I'll have to be online. So you need to keep doing that
until I come back online, which could be the middle of the night in NYC, or
you could just give up.

You could also use SMTP. Send & forget knowing it will get to me and I will
read it when I go online and reply to you. What if the message gets lost?
You could wait five days, look at all the people who never replied to you
and resend. Or you can just let the SMTP server handle that, since it has
node-to-node nacks (what is not nacked in a given time frame is by default
acked, not entirely reliable but better than nothing).

Which option would you choose to continue this conversation?


> I want to point out that it's quite feasible for applications -- clients
> and servers -- to conduct their business asychronously while at the
> same time communicating synchronously.  Or else what are telephones
> all about?  "I'll get back to you on that" is a synchronous reply
> signalling a business decision to postpone part of the business
> process.  Has there been an assumption that asynchrony in business
> process implies asynchrony in communication protocol?  Maybe
> we need to decouple there.

Ever heard of voice mail? Faxes? Pagers? Blueberry?

If synchronous communication works so well, why bother with voice mail.
Maybe it's the "I don't want to keep calling you every five minutes until
you get out of a meeting I don't know about, I'll just leave you a voice
mail".

Most large scale enterprise systems (and all the ones I know of) use
asynchronous communication at various points. Not just. But it would be
great if we had a solution that works the way people do business. I am not
saying people should always use voice mail, I am just saying voice mail
should be an option. I would hate to think how you could run a business
without voice mail.


> > The receiver can employ two strategies to improve its QoS. The receiver
> can
> > either make sure it never fails, or it can persist messages. Which
> strategy
> > it uses is up to the received. But statistically the one that chooses
> > persistence is going to give a higher QoS and those remain in business
> > longer. Queuing is optional just like friendly customer support is
> optional.
>
> In other words, to be reliable, a service must preserve state so that
> it can later transfer it, state transfer being the equivalent to
> end-to-end
> communication.  If by "queueing" you mean persistence of state, then
> I find your last sentence above curious.  It seems to say that application
> reliability is optional.  In the context of this discussion, it shouldn't
> be.

By queuing I mean persistence of message. If you can't process immediately
you can either hold everything in memory (assuming software never crashes)
or store it in a queue. You stand in line at the bank to make a non-ATM
transaction. The teller in front of you all of a sudden needs to go. They
don't just say "everyone in this line, please go out, come back in". They
route the line to the next available teller. That's queuing 101. Queuing
helps you build fault tolerant systems because eventually all messages gets
delivered (an RM property).


> Application reliability (reasonably designated above) is the real
> requirement.  As a developer of web services, I'd rather find that
> subject* treated directly in the architecture document than find a
> section on "RM", because the latter is not a full substitute for the
> former, and because its depth, complexity and challenge are a
> distraction from my real goal.  Summary: I think the focus on RM
> will diminish application reliability instead of fostering it because
> developers will tend not to believe that such a complex undertaking
> is not a full solution.

I agree. I think it is very important for the WS arch document to discuss
application reliability and separately messaging.

I do not claim that RM solves any application reliability problem per se.
But from the perspective of WS, the solution involves messaging. The WS
doesn't talk about database integrity, database logging, exception catching,
or the million other things you need there to get reliability. They talk
about the messages that services exchage as part of a coordinated message
exchange that strives for reliability.

Granted, these coordinations would often utilize RM as a way to build a
better coordination protocol, with, and let me repeat that again, RM doing
its part to address delivery of messages.

So the WS arch document needs to first identify that such coordination is a
requirement and should be addressed, then point out to the fact that such
coordination may elect to use RM for addressing delivery issues, which would
put RM in the right context.

arkin

>
> * Web Service Reliability
>
> Walden
Received on Sunday, 26 January 2003 16:47:14 UTC