Re: Distributed Systems and failure management

An even stronger statement in the same vein from Ken
Arnold [no relation, though we were sometimes
taken for brothers!]. As you can imagine, I feel
we have plenty of work to do in the area of reliable
messaging!
----
Failure is the defining difference between distributed and local
programming, so you have to design distributed systems with the
expectation of failure. Imagine asking people, "If the probability of
something happening is one in ten to the thirteenth, how often would it
happen?" Your natural human sense would be to answer, "Never." That is
an infinitely large number in human terms. But if you ask a physicist,
she would say, "All the time. In a cubic foot of air, those things
happen all the time." When you design distributed systems, you have to
say, "Failure happens all the time." So when you design, you design for
failure. It is your number one concern.

Received on Wednesday, 25 September 2002 18:10:43 UTC