Issue 179: Faster failover timings for RTP inactivity vs network failover from Robin Raymond on 2015-02-10 (public-ortc@w3.org from February 2015)

From: Robin Raymond <robin@hookflash.com>
Date: Tue, 10 Feb 2015 16:37:36 -0500
To: "public-ortc@w3.org" <public-ortc@w3.org>
Message-ID: <etPan.54da7a20.3dc240fb.191@Robins-iMac.local>

When RTP data is flowing over an ICE Transport via an RTP sender, it's easy to determine the ICE Transport is still alive / active due to the incoming / outgoing validated SRTP packets. However, a lack of activity is not necessarily an indication of failure thus the current ICE Transport "disconnect" state failure can be significantly long before an application becomes aware a problem is occurring.

Scenario A:
1) RTP Sender sends packets over ICE Transport
2) .stop() is called on RTP Sender and ICE Transport sends ICE connectivity checks in absence of RTP data (at a slow interval rate)

Scenario B:
1) RTP Sender sends packets over ICE Transport
2) A transport failure occurs and ICE Transport stops receiving results from connectivity checks and ICE transport eventually goes into "disconnected" state

Issue:
The application must wait until a full ICE Transport "disconnect" state has occurred before an ICE re-gather can be attempted and only then can the application exchange the new candidates after the initial failure actually happened (e.g. could be 30 seconds of time).

Solutions:
(a) ICE Transport could signal an issue might be happening well before the full "disconnected" state. This would allow the application layer to warm up the gatherer and get some new candidates well before waiting for a full disconnect to occur.
(b) ICE Transport could automatically warm up the ICE gatherer when there might be an issue detected and thus the ICE gatherer will obtain new (or refreshed) candidates and signal those new candidates in a manner as normal trickle ice would reveal then over time to be signalled to the remote party (thus the ICE Transport would effectively heal itself).
(c) Do nothing and wait until "disconnected" happens before the application is aware and live with the long time it will take to detect a full failure state.

Right now, we have (c) by doing nothing. I don't like this excessive timeout. I prefer option (b) where the gatherer is allowed to go out and get new candidates if it detects things are going wrong well before the full disconnect. I think (a) is good in that more control is allowed but I can see it difficult for the application developer to get right, especially if the ICE Transport falsely flags a temporary connectivity issue which resolves itself early (and thus the ICE gatherer can be put back to sleep by the ICE Transport straight away).

This issue is related to issue #174 and #176 


-Robin

Received on Tuesday, 10 February 2015 21:38:07 UTC