Re: RTCDataChannel characteristics and failures -API description - from Gunnar Hellstrom on 2014-09-07 (public-webrtc@w3.org from September 2014)

From: Gunnar Hellstrom <gunnar.hellstrom@omnitor.se>
Date: Sun, 07 Sep 2014 17:30:36 +0200
To: Michael Tuexen <Michael.Tuexen@lurchi.franken.de>
CC: public-webrtc@w3.org
Message-ID: <540C7A1C.9050301@omnitor.se>
Michael,
Thanks for the clarifications and corrections.

I think it would be good to have an information section in the WebRTC 
API document, giving a very brief summary of these charateristics with 
suitable precaution that it is valid for SCTP transport and normal 
network roundtrip time.

The fact that ICE liveliness check will usually fail after 30 seconds of 
very bad connection and usually close the channels should also be there.

This would be a good base for application developer for decisions to use 
reliable or unreliable mode and set the number of retries or max 
transmission time for the unreliable mode.

Gunnar
----------------------------

On 2014-09-06 16:51, Michael Tuexen wrote:
> On 06 Sep 2014, at 08:45, Gunnar Hellstrom <gunnar.hellstrom@omnitor.se> wrote:
>
>> On 2014-09-05 23:47, Martin Thomson wrote:
>>> On 5 September 2014 11:14, Gunnar Hellstrom <gunnar.hellstrom@omnitor.se> wrote:
>>>> So, you expect that in most cases the retries for a reliable channel will
>>>> spread over 30 seconds, and if still unsuccessful, the Association and all
>>>> its channels will be aborted.
>>> No, my point was that the association timers run longer than the
>>> timers that govern liveness of a single path.  That means that an
>>> association can survive a path failure if an alternative path is
>>> found, and it seems like we have ample time to do that.  Some
>>> implementations already attempt that process, though I can't speak to
>>> the efficacy.
>>>
>>>> Can you explain how you got that 30 seconds figure?
>>> https://tools.ietf.org/html/draft-ietf-rtcweb-stun-consent-freshness
>>>
>> Thanks.
>> So, the normal case with default values for an SCTP reliable channel seems to be:
>>
>> 1. SCTP retransmissions.
>> The transmission retries will be done  at 1 , 3, 7, 15, 31, 63, 127, ... seconds after initial transmission
> Correct.
>> 2. SCTP heartbeats
>> The SCTP heartbeats will be done asynchronously with the data transmission with 30 second intervals. Let us assume that it will be at 12, 42, 72, 102 seconds after data transmission.
> Incorrect:
> 1. You are making the assumption that the HEARTBEATs are send if you have outstanding data
>     that path. This is true for the FreeBSD implementation, but this is implementation
>     specific and not the best choice. It will be changed such that you don't send HEARTBEATs
>     when you have outstanding data.
> 2. HEARTBEATs are sent every HB.Interval + RTO. So it would be 12, 43, 75, 109, and so on
>     if there is no connectivity anymore.
> 3. If quick failover is used, the HEARTBEATs would not take HB.Interval into account if
>     the path is potentially failed. This makes failure detection for idle paths similar
>     to the failure detection for non-idle paths. Something good, I think.
>> 3. SCTP Max.Retrans
>> We have max 10 retransmissions for WebRTCP use of SCTP. In this value is both failed data transmissions and failed heartbeats included.
>> Thus this reason to consider data transmission failed would happen after 127 seconds with 7 data retransmissions and 3 missed heartbeats. ( my earlier calculation of 500 seconds did not take the heartbeat into consideration)
>> When this happens, the association breaks and all data channels within it are closed.
> Assuming that you don't send HEARTBEATs when you have outstanding data, you get (if compute correctly):
>
> Case 1: Non idle path
> Non responded transmissions at 0, 1, 3, 7, 15, 31, 63, 123, 183, 243, 303 and the association will fail after 363 seconds.
>
> Case 2a: Idle path, no quick failover
> Non responded HEARTBEAT transmissions at 0, 31, 63, 97, 135, 181, 243, 333, 423, 513, 603 and the association will fail after 693 seconds
>
> Case 2b: Idle path, quick failover
> Non responded HEARTBEAT transmissions at 0, 31, 33, 37, 45, 61, 93, 153, 213, 273, 333 and the association will fail after 393 seconds.
>
> As you see, when using quick failover, the difference in the detection time of communication loss between an idle path
> and an non-idle path is one HB.Interval. If quick failover is not used, the difference is (Max.assoc.retrans + 1) * HB.Interval.
>
> Quick Failover is specified in
> http://tools.ietf.org/html/draft-ietf-tsvwg-sctp-failover-05
>
> Best regards
> Michael
>> 4. ICE connectivity
>> In parallell with that, the ICE consent connectivity check is performed with a mean interval of 5 seconds, and if all of them fail within 30 seconds, the connection shall close. So, that is 6 consecutive transmissions of the ICE consent that would have failed.
>>
>> It is quite likely that we in this case have so bad network that the ICE liveliness test will fail before the combined retransmission and heartbeat failure. You say that ICE connectivity failure does not directly cause association abort. It will depend on the implementation if the path is refreshed. Would then the remaining RTCDataChannel retransmissions and hearbeats be done over that new path?  Will not the normal behavior for the RTCDataChannel be to close all channels on that connection at the ICE liveliness check failure?
>>
>>
>> I think that this information is piling up to something that would be of value to insert as an informational box in the API specification with a title: Example of retransmissions and connectivity checks for RTCDataChannel.
>>
>>
>> /Gunnar
>>
>>
Received on Sunday, 7 September 2014 15:31:13 UTC