ICE mobile Q&A / gathering / restart issues on mobile

I saw this good draft proposal from Justin:
http://juberti.github.io/draughts/nombis/draft-uberti-mmusic-nombis.html


In relation I wanted to explain and address issues related to ICE gathering and mobility when it comes to ORTC so I wrote this up as an explanation as to some of the issues/concerns:

ICE Mobile Q & A

Why are traditional ICE restarts bad for mobile?

Interfaces on mobile go up / down like yo-yos. Wifi up. Wifi down. 3/4g up. 3/4g down. Requiring a full restart on mobile after "end of candidates" when interface changes go up/down is not a good idea. Waiting for ICE "disconnect" before doing a full restart is a bad idea (as it should be possible to test non used candidates as backup quickly long before a full communication breakdown).

Within ORTC, to do an ICE restart as defined in RFC 5245 requires the construction of a new IceGatherer object (passed as an argument to IceTransport.start()) so as to change the ICE username and password. Where a single offer goes out with multiple device replies (forking), the IceTransport(s) corresponding to the forks typically use the same IceGatherer, so as to enable reusing the host, reflexive and relay candidate allocations.  However, if a new IceGatherer object is required each time one of the IceTransports restarts, then each of the forks will end up with their own IceGatherer object, leading to wasted host, reflexive and relay candidate allocations. There is no need to allocate more than one set of host, reflexive (firewall pinholes), TURN ports (i.e. one IceGatherer) to support communication with multiple devices.

Why are non-traditional but full regather and re-test restarts bad on mobile?

To not waste port allocations, the same IceGatherer could be reused in non traditional restart but the IceGatherer could forced to regather candidates from scratch when interfaces go up or down.  

Given the potential for frequent changes in the state of wifi/3g/4g interfaces, it is highly undesirable to require a full restart every time interfaces go up/down (particularly when taking IPv6 into account). Among other things, this can require constantly powering up of cellular data or WiFi interfaces in order to re-allocate server reflexive and relay candidates. This is not acceptable. The only time it should be necessary to power up unused wireless interfaces is when the IceTransport state becomes “disconnected” and there is no alternative but to restart.

What about the requirement to support traditional restart for SIP (with rollback)?

ORTC supports this already. Construct new IceGatherer and IceTransport objects and call IceTransport.start().  Since the original IceGatherer and IceTransport objects remain in place, in the event “rollback” is required the original state remains available.  


Why is purging all non-used remote candidates / candidate pairs bad for mobile?

3g/4g and wifi host candidates may remain available even if TURN allocations are shut-down due to inactivity.  For example, with IPv6 stateless address allocation (or even with IPv6 privacy addresses) host candidates can persist for some time even if the interface has been put to sleep in order to save energy.  When the interface re-awakens reusing the IPv6 host candidate may only require conflict detection and restarting of connectivity checks. In such a situation, being able to keep remote IPv6 host candidates available alive for quick retesting is a big win (and likely to succeed).  In contrast, traditional ICE would require that unused candidates be purged upon candidate nomination.  

Why does the IceGatherer need to be able to re-gather at request?

When the IceTransport enters the “disconnected” state, the nominated pair no longer has connectivity so that media will no longer be sent/received.  To enable continuity, it is necessary to re-gather candidates.  This can either be accomplished by constructing a new IceGatherer, or by allowing an existing IceGatherer to regather reflexive and relay candidates so that these can be provided to the remote peer.  

After an initial offer is forked to a set of devices, and the multiple IceTransport objects created to handle each fork have completed their connectivity checks, nominated candidate pairs and pruned unused candidates, a new response fork could arrive so that there is a need to re-gather candidates.  Rather than changing the ufrag/password which can only be achieved by constructing a new IceGatherer (and yet another Offer/Answer exchange), it is desirable to reuse an existing IceGatherer, so as to avoid duplicating host, server reflexive and relay allocations. In addition, the code can be far simpler if only one IceGatherer is used for all forks.  

Why is continuous gathering needed?

Mobile devices see wireless wifi/3g/4g interfaces go up and down constantly.  Relaying new candidates as interfaces come up makes it more likely that connectivity can remain established with associated fallback attempts rather than continually transitioning into the “disconnected” state (where media is broken).  

Why does the IceGatherer need end-of-candidates for a continuous gathering mode?

Without an "end of candidates" indication, there needs to be another way for the application to determine whether new ICE candidates are expected in the near future. Alternatives include a magic timeout or a change in the IceGatherer state to “idle”, indicating that ICE candidate gathering has completed for the moment, but that candidates have not yet been pruned (thus the IceGatherer is not in a “sleep” state).  

Why is there mutiple "end of candidate" events in continuous gathering mode?

If the IceGatherer can be restarted at any time (such as by adding an IceGatherer.gather() method) then it no longer makes sense to have “end of candidates” emitted from the IceGatherer.  Instead, an event can be provided when the IceGatherer.state changes, so as to enable the application to know when each successive round of ICE candidate gathering has completed.  

Why is aggressive nomination bad for continuous gathering?

Aggressive nomination requires the highest priority candidate pair be automatically selected by both parties by having each candidate pair include the “use candidate” flag. This does not allow fallback to a lower priority candidate pair to occur such as when switching from wifi to 3g/4g. A better solution is to allow any successfully tested candidate pair to be used while preferring the nominated pair selected by the controlling peer. This allows the controlling peer to switch from higher to lower priority candidate pairs as needed.

Why is keeping TURN candidates alive a bad idea for mobile?

Keeping relay candidates alive requires sending periodic packets. This in turn prevents 3g/4g interfaces from going into power management states so as to reduce power consumption.  This is not good policy on a mobile device.  Continually testing non-nominated pairs in order to keep remote relay candidates alive has a similar problem, since in order to respond to the incoming checks the 3g/4g interfaces may need to be woken up.

Suggested Changes to ORTC

Add a "continuous gathering" mode to the "ice gatherer options" where aggressive nomination will NOT be used by default.

Add an IceGatherer state + event as the mechanism to determine “end of candidates" when in continuous mode (rather than repeat firing an "end of candidates" state). The "end of candidates" does not need to fire in "continuous gathering mode" as that state can be determined based upon IceGatherer.state transitioning from "gathering" to "idle".

If more than one nominated candidate pair is allowed (as is proposed for multipath RTP), then nominated pairs can be kept alive via candidate pair checks.  However, maintaining ongoing connectivity checks is expensive, so that maintaining multiple nominated candidate pairs may not necessarily be desirable where multipath RTP is not in use.  

As an alternative, it may be useful to maintain host, server reflexive and relay candidates for a pre-determined amount of time that is communicated to the remote peer, rather than pruning all non-nominated candidates.  This enables a peer obtaining a new candidate to begin testing connectivity with a set of “known good” remote candidates.  Potential approaches include:  
Option a - Individual candidate "lifetime" parameters.  This approach requires considerable additional overhead in signaling.  Since host candidates may not be available “forever” (since wifi/3g/4g interfaces can be sleeping), the associated server reflexive and relay candidates each also have an associated “lifetime”.  
Option b - All candidates have a collective "lifetime". This lifetime is arbitrary and presumed. E.g. the IceGatherer keeps candidates alive on a “best effort” basis (e.g. for "3 minutes") after IceGatherer.gather() is called. This provides a finite window for candidates to either be selected for use by the IceGatherer. After the collective “lifetime” has expired, the IceGatherer transitions to the “sleep” state where it will respond to connectivity checks but shut down any non-used relay candidates while keeping host candidates alive for future connectivity checks (and thus fallback).
Option c - Same as "b" but developer could tell the IceGatherer to keep the candidates alive on best case effort for a non arbitrary time like "10 minutes". This gives a 10 minute window for these candidates to be used or they get pruned away from non-use by the IceGatherer.  
Option d - Information on added and  deleted candidates is trickled as it is learned on the local peer.  This tells the remote party exactly which candidates are available which are not.
Option e - Combine option b+d or c+d. Candidates are kept alive on a best effort basis for a period of time and candidates are added/deleted as they go up/down.

Allow the IceGatherer to re-gather via addition of an IceGatherer.gather() method.  

Add a “sleep” state to the IceGatherer.  When the IceGatherer transitions to the “sleep” state, it means that the IceGatherer has pruned unused ICE candidates (especially those requiring remote allocations such as server reflexive or relay candidates) but still initiates connectivity checks on nominated candidate pairs and responds to incoming connectivity checks.  Host candidates that do not cost much to be kept alive can be retained for fallback scenarios, even if they are not used in nominated candidate pairs.  

Received on Wednesday, 10 December 2014 20:34:36 UTC