Re: Proposed text on reliability in the web services architecture

----- Original Message -----
From: "Assaf Arkin" <arkin@intalio.com>
To: "Miles Sabin" <miles@milessabin.com>; <www-ws-arch@w3.org>
Sent: Wednesday, January 22, 2003 5:52 PM
Subject: RE: Proposed text on reliability in the web services architecture


>
> Here's my view of thing (abriged version).
>
> In a perfect world where messages are never lost, one way flows do not
> assure any level of reliability. If I send a request to buy products X/Y
> which I expect to be shipped within 2~3 weeks, the message gets processed
> but the products are not available (book out of print), I have to wait 3
> weeks to determine that I will not receive my product.

You're mixing apples and oranges here.  In a perfect world where
messages are never lost, message reliability is axiomatic and assumed.
But above where you'd be waiting 3 weeks because while messaging
was perfect, the supply chain wasn't, the thing that's missing is app
level coordination, not reliability of messaging.  Apples and oranges.
Applications and orangutans.  Whatever.

>
> That's a bad proposition. I would like to receive at some point (say 8
hours
> later) a message confirming whether the delivery would be made or not.
> That's how I achieve reliablity of the application, and I cannot think of
> any other way.

I'd rather interact with an application that can tell me immediately (i.e.,
fast enough for synchronous exchange) what the state of supply is, or a
piece of meta-supply-state that means "not sure if we can fulfill".
Remember
this RM stuff is supposed to be in service of real and robust business
applications.  That being the case, it does no good to assume badly
designed applications as the basis (requirements) for RM features.

Note that your paragraph above can be summarized by saying that
reliability of the application is a matter of State Transfer.  True?

>
> In a non-perfect world messages may be lost. The fact that a message has
> been lost means I will have to wait 8 hours to determine that. This is a
> lousy failure detection algorithm. Why wait 8 hours?

Note that this problem has been "swept" by the more important
problem above, and its solution.  Let's optimize and solve this problem
only once, at the application level.  Do you favor optimization?

>
> Let's say I do synchronous delivery using TCP. I start sending the message
> and near the end the TCP connection drops. I can say "fine, I think the
> message got there", or I can say "oh, oh, message loss". In the first case
I
> would wait 8 hours to determine whether the message was
delivered/processed.
> In the second case I am more responsive, I can react immediately by
openning
> another connection and sending it again. I am doing RM.

Looking back, the 8 hours of your use case above is the time allowed
for the service application to asynchronously advise of product supply
state.  This, therefore, is apples and oranges again, and even your
8 hours assumption above fails, I think, because the supply application
may not know about you at all.

>
> Now let's say I use queues. I put the message in a queue and I wait 8
hours
> for a response. The MOM picks the message from the queue, sends it, TCP
> connection drops, if say "oh, well, life goes on". I wait 8 hours and get
no
> response. What if the MOM would simply retry to send the message again?
The
> queue is fulfilling the RM responsibility.

Yes it is.

>
> Now let's say the receiver decides not to process messages as they come,
> instead it queues them for later processing. The queue is not persisted.
If
> the TCP connection drops the message never gets to the queue. It will not
be
> delivered, so there's no ack. The sender needs to retry again. If the
> message gets into the queue it's acked. It will possibly delivered.

I think by "receiver" here you mean the supply application, not some
server side of the RM machine.  But you're talking about messages being
delivered and (I think) products being delivered, and it's hard to tell
which is which.

If a supply application receives my order for goods but does not
store it safely, then we are once again talking about a brain dead
application, and no amount of RM is going to fix that.  Let's avoid
talking about brain dead applications, okay?

If the application receives my order and makes its state retrievable,
then I can retrieve that state at any time.  This constitutes application
level reliability.  Any time I can't retrieve state because of underlying
comms breakage, I can distinguish that from bad application state
because it looks like a time-out, not a missing resource or a resource
in the wrong state.  While this doesn't mean goods will be delivered,
it means we know what's broken -- the best that can be accomplished
in the name of distributed applications reliability.

>
> The sender cannot distinguish between a message that was not delivered and
a
> message that was not processed. So for the sender the fact that the
message
> has arrived at its destination fully intact warrants an ack.

I think you're saying that the sending (requesting) application wants
an acknowledgment that a message was delivered.  I think you're wrong
about what it wants.  It wants a state transfer.  An end-to-end thingy.

>
> The receiver takes two hours before it can start processing the message.
> During the two hours it may crash, message is lost. This is equivalent to
> message not being processed for any other reason. But, it takes six more
> hours to find this out. So the receiver has a lousy QoS. I will elect not
to
> do business with this supplier.

Would you please elect not to design RM systems for it also?  It's a
waste of brain cycles.  We're supposed to be fostering best practices.
(Okay, I already lectured on this above, so no more.)

I want to point out that it's quite feasible for applications -- clients
and servers -- to conduct their business asychronously while at the
same time communicating synchronously.  Or else what are telephones
all about?  "I'll get back to you on that" is a synchronous reply
signalling a business decision to postpone part of the business
process.  Has there been an assumption that asynchrony in business
process implies asynchrony in communication protocol?  Maybe
we need to decouple there.

>
> The receiver can employ two strategies to improve its QoS. The receiver
can
> either make sure it never fails, or it can persist messages. Which
strategy
> it uses is up to the received. But statistically the one that chooses
> persistence is going to give a higher QoS and those remain in business
> longer. Queuing is optional just like friendly customer support is
optional.

In other words, to be reliable, a service must preserve state so that
it can later transfer it, state transfer being the equivalent to end-to-end
communication.  If by "queueing" you mean persistence of state, then
I find your last sentence above curious.  It seems to say that application
reliability is optional.  In the context of this discussion, it shouldn't
be.

Application reliability (reasonably designated above) is the real
requirement.  As a developer of web services, I'd rather find that
subject* treated directly in the architecture document than find a
section on "RM", because the latter is not a full substitute for the
former, and because its depth, complexity and challenge are a
distraction from my real goal.  Summary: I think the focus on RM
will diminish application reliability instead of fostering it because
developers will tend not to believe that such a complex undertaking
is not a full solution.

* Web Service Reliability

Walden

Received on Sunday, 26 January 2003 11:47:54 UTC