- From: Alex Kozlenkov <alex.kozlenkov@betfair.com>
- Date: Thu, 8 Jun 2006 13:13:44 +0100
- To: <public-rif-wg@w3.org>
Importantly, there is a combination of a push logic and a pull logic involved. I'll make the UC clearer and post on WIKI in a formal format. A Manager node is responsible for holding housekeeping information about various servers playing different roles. When a server fails to send a heartbeat for a specified amount of time, the Manager assumes that the server failed and cooperates with the Agent component running on an unloaded node to resurrect it. A typical rule for receiving and updating the latest heartbeat in event notification style would look like this: rcvMsg(XID,Protocol,FromIP,inform,heartbeat(Role,RemoteTime)) :- time(LocalTime) update(key(FromIP,Role),heartbeats(FromIP,Role, RemoteTime,LocalTime)). .The rule responds to a message pattern matching the one specified in the rcvMsg arguments. XID is the correlation-id of the incoming message; inform is called a performative representing the semantic type of the mes-sage, in this case, a one-way information passed between parties; heart-beat(...) is the payload of the message. The body of the rule enquires about the current local time and updates the record containing the latest heartbeat from the controller. This rule follows a push pattern where the event is pushed towards the rule systems and the latter reacts. A pull-based ECA rule that is activated every second by the rule engine and for each server that fails to have sent heartbeats within the last second will detect server failures and respond to it by initiating failover to the first available unloaded server. The accompanying derivation rules detect and respond are used for specific purpose of detecting the failure and organising the response. eca( time( every('1S') ), event( detect(controller_failure(IP,Role,'1S')) ), action( respond(controller_failure(IP,Role,'1S')) ) ). detect(controller_failure(IP,Role,Timeout)) :- time(LocalTimeNow), heartbeats(IP,Role,RemoteTime,LocalTime), LocalTimeNow-LocalTime > Timeout. respond(controller_failure(IP,Role,Timeout)) :- time(LocalTime), first(holdsAt(status(Server,unloaded),LocalTime)), update(key(Server),happens(loading(Server),LocalTime)), sendMsg(XID,loopback,self,initiate,failover(Role,IP,Server)). The ECA logic involves possible backtracking so that all failed compo-nents will be resurrected. The state of each server is managed via an event calculus formulation: initiates(loading(Server),status(Server,loaded),T). terminates(unloading(Server),status(Server,loaded),T). initiates(unloading(Server),status(Server,unloaded),T). terminates(loading(Server),status(Server, loaded),T). The actual state of each server is derived from the happened loading and unloading events and used in the ECA rule to detect the first server which is in state "unloaded". This EC based formalization can be easily ex-tended, e.g. with new states such as a maintenance state which terminates an unloaded state, but is not allowed in case a server is already loaded: initiates(maintaining(Server),status(Server,maintenance),T):- not(holdsAt(status(Server,loaded),T)). terminates(maintaining(Server),status(Server,unloaded),T). Due to space restrictions we can not show further extensions. However, as it can be already seen from the initial examples further, higher-level deci-sion logics, such as SLA contract rules, defining quality of service poli-cies, e.g. average availability levels and penalty payments in case these service levels can not be met, might be easily build upon this basic set of failover handling rules using further ECA, EC and event notification rules. _______________________________________________
Received on Thursday, 8 June 2006 12:13:57 UTC