RE: Issue 179: Faster failover timings for RTP inactivity vs network failover from Robin Raymond on 2015-02-18 (public-ortc@w3.org from February 2015)

From: Robin Raymond <robin@hookflash.com>
Date: Wed, 18 Feb 2015 08:24:49 -0500
To: "public-ortc@w3.org" <public-ortc@w3.org>, Bernard Aboba <bernard.aboba@microsoft.com>
Message-ID: <etPan.54e492a1.49d0feac.191@Robin-iMac.home>
I do not like option (a) [i.e. programmer has to handle “possible but unconfirmed” nomination failures. I think the burden to the programmer would be high and the chance of success lower. Option (b) allows the engine to handle these cases and allows many optimized scenarios and is more likely to get interfaces working again.

There is one issue. According to standard ICE rules, the non nominated candidates are pruned. Thus according to normal ICE, there would be no “backup” ice candidates left to test after ice nomination completed when the final nominated candidate pair failed.

Having no backup ice candidates available due to IceTransport pruning is problematic for this scenario. However, choosing to keep all remote ice candidates as a potential backup candidates is problematic because most of those ice candidates will have been remove by the IceGatherer due to inactivity [under normal situations] (e.g. all ice candidates but ‘host’ which require no additional resources to keep alive might be now invalid). The trouble is there is no way to tell which of the remote ice candidates are valid or not valid because the remote IceGatherer only indicates when new ice candidates are available and never when ice candidates are removed from the remote IceGatherer.

A simple “work” around would be to have the IceGatherer have a “keep candidate warm” timeout as proposed for resolution to issue 174 [“When is an IceGatherer allowed to prune host, reflective and relay candidates?”]. An application could assume all ice candidates after that timeout period passed into the IceGatherer are no longer valid. But that solutions means the backup ice candidate has an increased lifetime window in which to work but ultimately must be pruned too eventually (unless some assumptions are made about what exactly an IceGatherer will prune or not prune, e.g. all non host candidates are assumed never to be pruned).

The better solution is to not assume anything about what gets pruned by an IceGatherer and leave that up to the IceGatherer’s policy. Instead events could be fired when ice candidates are removed from the IceGatherer. For example, an IceGatherer may choose to keep just host alive, or keep host and reflexive alive (or just certain host and turn alive). By firing events of exactly which ice candidates are removed from the IceGatherer, the remote party can be told exactly which ice candidates remain viable as backup candidates and which ones are to be pruned.

So my proposed solution would be to add:

partial interface IceGatherer {
  attribute eventhandler onlocalcandidateremoved;
};

OR

partial interface IceGatherer {
  // This would allow “getLocalCandidates()” to return the active list available to be paired with a call to “setRemoteCandidates” on the remote side.
  attribute eventhandler onlocalcandidateschanged;
};


This would require trickling of not only new ice candidates but removed ice candidates. The benefit would be having backup ice candidates available for mutual testing by the local and remote IceTransports when the nominated ice candidate pair might be failing (and thus leads to faster failover and greater resilience).

-Robin


On February 14, 2015 at 5:36:06 PM, Bernard Aboba (bernard.aboba@microsoft.com) wrote:

As you say, Option (a) would very likely result in false positives, even if it were only fired after loss of several consent checks. Under Option (b) I would include a variety of future ICE improvements, such as support for multiple nominated pairs (e.g. ICE implementation could switch from a failing candidate pair to a working one prior to consent failure), continuous gathering (e.g. implementation brings up a WWAN interface in response to decreasing signal strength on the WLAN interface on which consent failures are happening), etc. So this one seems the most reasonable.  
________________________________________  
From: Robin Raymond [robin@hookflash.com]  
Sent: Tuesday, February 10, 2015 1:37 PM  
To: public-ortc@w3.org  
Subject: Issue 179: Faster failover timings for RTP inactivity vs network failover  

When RTP data is flowing over an ICE Transport via an RTP sender, it's easy to determine the ICE Transport is still alive / active due to the incoming / outgoing validated SRTP packets. However, a lack of activity is not necessarily an indication of failure thus the current ICE Transport "disconnect" state failure can be significantly long before an application becomes aware a problem is occurring.  

Scenario A:  
1) RTP Sender sends packets over ICE Transport  
2) .stop() is called on RTP Sender and ICE Transport sends ICE connectivity checks in absence of RTP data (at a slow interval rate)  

Scenario B:  
1) RTP Sender sends packets over ICE Transport  
2) A transport failure occurs and ICE Transport stops receiving results from connectivity checks and ICE transport eventually goes into "disconnected" state  

Issue:  
The application must wait until a full ICE Transport "disconnect" state has occurred before an ICE re-gather can be attempted and only then can the application exchange the new candidates after the initial failure actually happened (e.g. could be 30 seconds of time).  

Solutions:  
(a) ICE Transport could signal an issue might be happening well before the full "disconnected" state. This would allow the application layer to warm up the gatherer and get some new candidates well before waiting for a full disconnect to occur.  
(b) ICE Transport could automatically warm up the ICE gatherer when there might be an issue detected and thus the ICE gatherer will obtain new (or refreshed) candidates and signal those new candidates in a manner as normal trickle ice would reveal then over time to be signalled to the remote party (thus the ICE Transport would effectively heal itself).  
(c) Do nothing and wait until "disconnected" happens before the application is aware and live with the long time it will take to detect a full failure state.  

Right now, we have (c) by doing nothing. I don't like this excessive timeout. I prefer option (b) where the gatherer is allowed to go out and get new candidates if it detects things are going wrong well before the full disconnect. I think (a) is good in that more control is allowed but I can see it difficult for the application developer to get right, especially if the ICE Transport falsely flags a temporary connectivity issue which resolves itself early (and thus the ICE gatherer can be put back to sleep by the ICE Transport straight away).  

This issue is related to issue #174 and #176  


-Robin
Received on Wednesday, 18 February 2015 13:25:19 UTC