RE: Proposed text on reliability in the web services architecture from Assaf Arkin on 2003-01-22 (www-ws-arch@w3.org from January 2003)

From: Assaf Arkin <arkin@intalio.com>
Date: Wed, 22 Jan 2003 14:52:45 -0800
To: "Miles Sabin" <miles@milessabin.com>, <www-ws-arch@w3.org>
Message-ID: <IGEJLEPAJBPHKACOOKHNOEEGDBAA.arkin@intalio.com>
Here's my view of thing (abriged version).

In a perfect world where messages are never lost, one way flows do not
assure any level of reliability. If I send a request to buy products X/Y
which I expect to be shipped within 2~3 weeks, the message gets processed
but the products are not available (book out of print), I have to wait 3
weeks to determine that I will not receive my product.

That's a bad proposition. I would like to receive at some point (say 8 hours
later) a message confirming whether the delivery would be made or not.
That's how I achieve reliablity of the application, and I cannot think of
any other way.

In a non-perfect world messages may be lost. The fact that a message has
been lost means I will have to wait 8 hours to determine that. This is a
lousy failure detection algorithm. Why wait 8 hours?

Let's say I do synchronous delivery using TCP. I start sending the message
and near the end the TCP connection drops. I can say "fine, I think the
message got there", or I can say "oh, oh, message loss". In the first case I
would wait 8 hours to determine whether the message was delivered/processed.
In the second case I am more responsive, I can react immediately by openning
another connection and sending it again. I am doing RM.

Now let's say I use queues. I put the message in a queue and I wait 8 hours
for a response. The MOM picks the message from the queue, sends it, TCP
connection drops, if say "oh, well, life goes on". I wait 8 hours and get no
response. What if the MOM would simply retry to send the message again? The
queue is fulfilling the RM responsibility.

Now let's say the receiver decides not to process messages as they come,
instead it queues them for later processing. The queue is not persisted. If
the TCP connection drops the message never gets to the queue. It will not be
delivered, so there's no ack. The sender needs to retry again. If the
message gets into the queue it's acked. It will possibly delivered.

The sender cannot distinguish between a message that was not delivered and a
message that was not processed. So for the sender the fact that the message
has arrived at its destination fully intact warrants an ack.

The receiver takes two hours before it can start processing the message.
During the two hours it may crash, message is lost. This is equivalent to
message not being processed for any other reason. But, it takes six more
hours to find this out. So the receiver has a lousy QoS. I will elect not to
do business with this supplier.

The receiver can employ two strategies to improve its QoS. The receiver can
either make sure it never fails, or it can persist messages. Which strategy
it uses is up to the received. But statistically the one that chooses
persistence is going to give a higher QoS and those remain in business
longer. Queuing is optional just like friendly customer support is optional.

Conclusion:

- RM doesn't say anything about processing of messages, only delivery

- An appliction can implement RM itself, an application can also implement
queuing itself, an application can also implement SOAP encoding/decoding
itself

- Many applications would benefit if they can use someone else's queue, SOAP
encoding/decoding etc (in this case the gateway)

- The queue could be much better if it uses acks to resend potentially lost
messages and also use persistent to protect from failure

- The queue doesn't have to do that, but a framework that incorporates acks
and lets it persist messages is beneficial to all parties

- My proposal is only to allow this layer to exist through an abstract
interface which allows the application to exert some control (e.g. try
once/do your best, only deliver within 8 hours) and allows the layer to
elect whichever strategy works best (depending on protocol) and allows two
RMs to exchange acks to *improve* overall reliability

arkin


> -----Original Message-----
> From: www-ws-arch-request@w3.org [mailto:www-ws-arch-request@w3.org]On
> Behalf Of Miles Sabin
> Sent: Wednesday, January 22, 2003 5:54 AM
> To: www-ws-arch@w3.org
> Subject: Re: Proposed text on reliability in the web services
> architecture
>
>
>
> Assaf Arkin wrote,
> > Miles Sabin wrote,
> > > So there's a gap between the parties which are making the visible
> > > commitments (the WS adapters) and the parties which are ultimately
> > > responsible for meeting them (the endpoints). Whether that gap is
> > > narrow and/or easily bridged, or an all consuming abyss is likely
> > > to vary on a case-by-case basis. I'm sure many of us on this list
> > > have experienced both.
> >
> > You have to decide what is the service and what is the application.
> > If you have a message handler there that allows your application to
> > receive messages over HTTP, the message handler is not the service.
> > It's a proxy that takes care of the HTTP/SOAP/etc details on behalf
> > of the actual service.
>
> That's the ideal, certainly.
>
> But the reality is that this is often very hard to do. In a not
> completely implausible senario we might have, say, seven largely
> independent organizations involved: the legacy system vendor, the two
> sites which deploy that system, two consultancies providing the WS
> gateways (one at each site), each using a WS toolkit from a different
> WS tool vendor.
>
> In such circumstances clarity on the boundary between service and
> application is going to take a lot of work. If differences of opinion
> or outlook, or miscommunication, show through in the protocol or the
> way the protocol is used, then RM is likely to be the least of anyone's
> worries.
>
>
>
> Cheers,
>
>
> Miles
Received on Wednesday, 22 January 2003 17:53:56 UTC