Outline of a Betfair use case

Importantly, there is a combination of a push logic and a pull logic
involved. I'll make the UC clearer and post on WIKI in a formal format.

A Manager node is responsible for holding housekeeping information about
various servers playing different roles. When a server fails to send a
heartbeat for a specified amount of time, the Manager assumes that the
server failed and cooperates with the Agent component running on an
unloaded node to resurrect it. A typical rule for receiving and updating
the latest heartbeat in event notification style would look like this:
rcvMsg(XID,Protocol,FromIP,inform,heartbeat(Role,RemoteTime)) :-
	time(LocalTime)
	update(key(FromIP,Role),heartbeats(FromIP,Role, 
                                     RemoteTime,LocalTime)).
.The rule responds to a message pattern matching the one specified in
the rcvMsg arguments. XID is the correlation-id of the incoming message;
inform is called a performative representing the semantic type of the
mes-sage, in this case, a one-way information passed between parties;
heart-beat(...) is the payload of the message. The body of the rule
enquires about the current local time and updates the record containing
the latest heartbeat from the controller. This rule follows a push
pattern where the event is pushed towards the rule systems and the
latter reacts. A pull-based ECA rule that is activated every second by
the rule engine and for each server that fails to have sent heartbeats
within the last second will detect server failures and respond to it by
initiating failover to the first available unloaded server. The
accompanying derivation rules detect and respond are used for specific
purpose of detecting the failure and organising the response.
eca(	time( every('1S') ),
	event( detect(controller_failure(IP,Role,'1S')) ),
	action( respond(controller_failure(IP,Role,'1S')) ) ).
detect(controller_failure(IP,Role,Timeout)) :-
	time(LocalTimeNow),
	heartbeats(IP,Role,RemoteTime,LocalTime),
	LocalTimeNow-LocalTime > Timeout.
respond(controller_failure(IP,Role,Timeout)) :-
	time(LocalTime),
	first(holdsAt(status(Server,unloaded),LocalTime)),
	update(key(Server),happens(loading(Server),LocalTime)),
 
sendMsg(XID,loopback,self,initiate,failover(Role,IP,Server)).
The ECA logic involves possible backtracking so that all failed
compo-nents will be resurrected. The state of each server is managed via
an event calculus formulation:
initiates(loading(Server),status(Server,loaded),T).
terminates(unloading(Server),status(Server,loaded),T).
initiates(unloading(Server),status(Server,unloaded),T).
terminates(loading(Server),status(Server, loaded),T).
The actual state of each server is derived from the happened loading and
unloading events and used in the ECA rule to detect the first server
which is in state "unloaded". This EC based formalization can be easily
ex-tended, e.g. with new states such as a maintenance state which
terminates an unloaded state, but is not allowed in case a server is
already loaded:
initiates(maintaining(Server),status(Server,maintenance),T):-
not(holdsAt(status(Server,loaded),T)).
terminates(maintaining(Server),status(Server,unloaded),T).
Due to space restrictions we can not show further extensions. However,
as it can be already seen from the initial examples further,
higher-level deci-sion logics, such as SLA contract rules, defining
quality of service poli-cies, e.g. average availability levels and
penalty payments in case these service levels can not be met, might be
easily build upon this basic set of failover handling rules using
further ECA, EC and event notification rules.
_______________________________________________

Received on Thursday, 8 June 2006 12:13:57 UTC