Re: Alternative algorithm for timeouts from Michiel de Jong on 2017-07-20 (public-interledger@w3.org from July 2017)

From: Michiel de Jong <michiel@unhosted.org>
Date: Thu, 20 Jul 2017 10:11:13 +0200
To: Enrique Arizon Benito <enrique.arizon.benito@everis.com>
Cc: Interledger Community Group <public-interledger@w3.org>
Message-ID: <CA+aD3u2YzFSS7GPgqcN=JEjb7AAJJY9E0TY+H4kKgg4sfjX8GA@mail.gmail.com>
Hi Enrique,

One way to implement something more flexible would be to set a timeout of
24 hours on your payment, but "urge" connectors to be much much quicker
than that during normal operation. Most payments would only take 2.1
seconds in your example. If an error occurred due to: no route found,
insufficient liquidity, insufficient source amount, receiver account
doesn't exist, or receiver is not listening, or receiver does not know how
to fulfill the condition, you will get an ILP error back, also in roughly
2.1 seconds. *Only* in the event that the comms network fails, would
payments be tied up for up to 24 hours.

An argument for why 24 hour timeouts could be OK is that if you do business
with a connector, and you have money locked up in a one-to-one payment
channel to that connector, then if it's having network trouble, you can't
reuse that balance for other purposes anyway (unless you close the channel
one-sidedly). This also goes if you peer over a trustline. It doesn't hold
when you peer over a FiveBells ledger, and some of your other peers also
have an account on that same ledger, because then you would want a quick
rollback of the money into your FiveBells ledger account, so that you can
reuse that balance to route payments to other peers who are on the same
ledger.

However, even though 24 hour timeouts could work in some circumstances,
there is a network effect; if one connector uses long timeouts then that's
useless for later connectors in the chain of transfer (they will still be
in a hurry), unless they do too. Also, a connector that uses long timeouts
will not even be invited to be part of a chain of transfers if there is at
least one connector earlier in the chain, that is in more of a hurry.

Each connector will have a minimal timeout which it requires itself, and
maximum timeout it allows other connectors to have. Due to this network
effect, it's useful if we all use roughly the same minimum and maximum
values (at least within an order of magnitude). For instance, the current
default settings for ilp-kit are:
* the receiver is allowed EXPIRY_DURATION = 10 seconds to fulfill
* each connector will add MIN_MESSAGE_WINDOW = 1 second (500ms in each
direction IIUC)
* each connector will accept a total of up to
CONNECTOR_MAX_HOLD_TIME=100 seconds

A network with identical nodes like that will in theory be able to handle
up to 90-hop payments. But if one of the connectors (or say, the receiving
ledger in your example) would set MIN_MESSAGE_WINDOW or EXPIRY_DURATION to
3600 seconds, then that would be incompatible with the
CONNECTOR_MAX_HOLD_TIME of other connectors in the network, and that would
make the network less efficient in forwarding payments.

So that's why we say it's recommended for everybody to stay near the
tens-of-seconds order of magnitude with their minimum and maximum timeout
values.

Cheers!
Michiel.

On Thu, Jul 20, 2017 at 9:12 AM, Enrique Arizon Benito <
enrique.arizon.benito@everis.com> wrote:

> In particular I was thinking about a network outage/slow-down in the
> specific case when such outage/slow-down is caused by the receiving ledger
> that fulfills the transfer (versus intermediate ledgers that just propagate
> the transfer back-and-forth) during the time when the transfer arrives to
> the time when the fulfillment triggers the execution of payment.
>
>  This is the most risky scenario, because the transfer could be fulfilled
> from the point of view of the receiving ledger, and unfulfilled from the
> originating one.
>
> I'm thinking the previous timeout algorith can even be more simple:
>
> Let's suppose next conditions:
>  - "delivery in time" is defined to be a maximun of 10 seconds
>  - In normal circumstances it takes 0.1secs for the receiving ledger to
> fulfill an incomming transfer.
>  - It takes a 1 second for the transfer to "arrive" at ledger2.
>  - It takes another 1 second for the fulfillment to propagate back.
>
> That means that in normal circumstances it takes a total of 1 + 0.1 + 1 =
> 2.1 seconds for the fulfillment to arrive, so the "delivery in time" is
> guaranteed.
>
>  It could be the case that in abnormal circumstances ledger2 is overloaded
> (or in the blockchain case, that the fees are greated than expected) and it
> takes 9 seconds to fulfill the transfer instead of 0.1 secs as ussual. In
> such case the total time will be 1 + 9 + 1 = 11 secs. Since this scenario
> will be unusual, it could be perfectly possible for ledger1 to wait
> slightly more to avoid the race-condition (fulfilled in ledger2,
> un-fulfilled in ledger1).
>
> Now the key point: If ledger2 takes more than the expected "delivery in
> time"  ***from the time a transfer arrives to the receiving ledger to the
> time the fulfillment is created***, then it ignores the fulfillment with a
> timeout and rollbacks.  In such case an ILPError will be transmited back to
> the originating ledger1 and eventually it will also rollback. There is no
> discrepancy by ledgers and connectors.
>
>  This mode of operation (mostly) eliminates the race-condition but raises
> a usability problem when the transfer fails from the user point of ledger1
> doing the payment. It will not loose any money, neither will a connector in
> the midst, but it will take more than "delivery in time" to be notified of
> the failure. I think this is mostly acceptable. Maybe it could be as simple
> as providing a feedback with a popup "The payment is taking an abnormal
> amount of time to be confirmed, please wait".
>
>  A bigger problem is that ledger1 does not know what "waiting slightly
> more" really means but different solutions exists. Some simple alternative
> algorithms:
> - An ILPError arrives from ledger2 indicating the timeout.
>   This is the simplest case, and will never cause a race-condition, but it
> could be the case that ledger2 is completly blocked and can not even send
> an ILPerror back.
>
> Then some alternative predictive (predictive == risk of "guessing the
> furute") algorithms can be used:
>
> - If transferN has not arrived after creation + fixed Timeout cancel.
> (This is the current algorithm).
>
>  - If transferN has not arrived in time, wait to see if transferN+1,
> transferN+2 arrive and then finally cancel transferN. It fails if the
> traffic is very low (there is no transferN+1), but actually if transferN+1
> fulfillment arrives and after a timeout transferN fulfillment has not,
> ledger1 can be pretty sure that the transfer failed.
>
> - If last 100 transfers to ledger2 arrived in time, let's add a some extra
> time-window since ledger1 will have good reasons to trust this ledger.
>
> - ....
>
>   Once we assume that sending ledger will use predictive algorithms to
> determine the timeout, the possibility for random race-conditions appears,
> so those predictive algorithms must be used just for the worst-case
> scenario (receiving ledger non-recovarable failure just after fulfillment
> and local execution and no error arriving to the originating ledger).
>
> Summarizing, IMHO, the ledger1 timeout calculus used right now is quite
> arbitrary and augment the risk for intermediate connectors in case of
> receiving ledger overload with minor benefits for final users.
>
> Regards,
>
> Enrique
>
>
>
>
>
>
>
> ------------------------------
> *De:* Michiel de Jong [michiel@unhosted.org]
> *Enviado:* miércoles, 19 de julio de 2017 14:08
> *Para:* Enrique Arizon Benito
> *CC:* David Fuelling; public-interledger@w3.org
>
> *Asunto:* Re: Alternative algorithm for timeouts
>
> Hi Enrique,
>
> I think I understand now what you're getting at; basically, if there is a
> network outage, wait a bit longer before you time out the transaction.
> Right? Your point reminds me of https://github.com/
> interledger/rfcs/issues/159, and I guess it's one of the more confusing
> things when you first start to think about how the Interledger protocol
> works. The hard timeouts used by Interledger require connectors to take a
> fulfillment risk: they have to set a minimal message window, and if they
> set this value too low, it might be that in practice they are too slow
> (because of server load, or network downtime), and they lose money.
>
> There is a reason though for the hard timeouts: it provides a predictable
> end-to-end experience: delivery on time, or your money back. The fact that
> connectors give you a sort of SLA and take care of the timeout risk, means
> that you as a sender don't have to, and you can count on a hard guarantee.
>
> You can't tell the next connector "it's ok if you're a bit late", because
> that only works if your own previous connector is also so kind to you, and
> so that's just moving the risk from one connector to the previous one.
>
> You could also set a very high timeout, which only ever gets reached if
> the comms links are really down. But that will only benefit the first few
> connectors on the path, only up to the point that their previous connector
> is willing to keep money on hold for them.
>
> You could imagine a protocol (different from ILP) where the "delivery on
> time, or your money back" is not a design goal. If the comms network is
> usually quite reliable, payment senders/receivers are usually not in a
> hurry (for instance, the network is designed for settling debt loops, which
> is not a while-you-wait use case) and pretending to be down is not a way in
> which neighboring connectors try to steal from each other (for instance,
> because the inflight amount they could steal is much lower than their
> ongoing business opportunity if they don't steal it), you could allow link
> downtime in the comms network to cause longer delays at the end-to-end
> level. You could then even, for instance, not specify a hard wallclock
> timeout time at all up front, but just send a "this is taking too long,
> please roll back" message to trigger the rollback of a payment that's
> taking too long.
>
> But in the case of ILP, I think there's no way around it, the sender gets
> a guarantee, and all connectors along the path have to take a fulfillment
> risk, in order to be able to provide this guarantee. If the sender sets a
> short timeout, then the connector has to set a timeout for the next hop
> that's even shorter.
>
>
> Cheers,
> Michiel.
>
> On Wed, Jul 19, 2017 at 10:02 AM, Enrique Arizon Benito <
> enrique.arizon.benito@everis.com> wrote:
>
>>  I'll try to make it more visual.
>>
>>  Let's suppose senders initiating the transfers are on ledger1 and
>> receivers are on ledger2.
>>
>>  Suppose "randomly" ledger1 initiates next transactions, ordered
>> according to its *local* time.
>>
>>  In this diagram the dot '.' represents a unit of time of 1/15/... secs:
>>
>>   timeline tx created@ledger1:
>>     |.........|.........|.........|.........|....
>>      ^     ^    ^   ^            ^       ^
>>     tx1   tx2  tx3 tx4          tx5     tx6
>>
>>
>>   timeline default timeouts@ledger1:
>>     |.........|.........|.........|.........|....
>>             ^     ^    ^   ^            ^       ^
>>            tx1   tx2  tx3 tx4          tx5     tx6
>>
>>
>>   timeline txs received@ledger2:
>>     |.........|.........|.........|.........|....
>>       ^     ^     ^   ^            ^       ^
>>      tx1   tx2   tx3 tx4          tx5     tx6
>>
>>
>>   timeline unavailability@ledger2:
>>     |.........|.........|.........|.........|....
>>
>>                        ^^^^^^^^^^^
>>                        unavailable
>>
>>
>>   timeline txs fulfilled@ledger2:
>>     |.........|.........|.........|.........|....
>>       ^     ^    ^                   ^   ^   ^
>>      tx1   tx2  tx3                 tx5 tx4 tx6
>>
>>
>>   timeline fulfillment received@ledger1:
>>     |.........|.........|.........|.........|....
>>        ^     ^     ^                   ^   ^  ^
>>       tx1   tx2   tx3                 tx5 tx4tx6
>>                                            ^
>>                                           race
>>                                         condition
>>
>>   In the previous diagram "unavailable" time represent any internal state
>> in
>> ledger2 that doesn't allow to process/fulfill the received ILP transfer (a
>> reset, overload, ...).  In the blockchain case it could also the case that
>> temporally miners fees are higher than normal, that will make the system
>> "unavailable" to TXs sent with lower fees.
>>
>>   The problem with race-conditions arise with transfers arriving just
>> before
>> the unavailable window time in ledger2 (tx4 in this case).
>>
>>   tx4 will be fulfilled by ledger2 but when the fulfillment arrives to the
>> originating ledger, the transfer is already expired (race condition).
>>
>>   Now imagine that the timeouts in ledger1 are established as explained
>> in the
>> previous mail.
>>
>>   Let's take K = 2, that means that timeout for tx4 doesn't start to count
>> until tx6 or tx7 or ... has arrived. That is, until tx_4+K has arrived.
>>
>>   In this case, tx6 is processed quickly as expected when the system is
>> running and not overloaded. tx6 fulfillment arrives to ledger1, and is at
>> that
>> moment that the timeout for tx4 starts to run. Since the fulfillment for
>> tx4
>> has already arrived shortly after tx5, the race condition dissapears.
>>
>>   Obviously there are other problems that can arise, like what happen if
>> ledger2 is unresponsive "for long time" and the queue of prepared
>> transfers in
>> ledger1 to ledger2 starts to acumulate. In this case we can define another
>> timeout, related to how much time we consider any ledger can be
>> un-responsive.
>> That is a timeout for ledger responses (versus a timeout for transfer
>> fulfillments).
>>
>>   When such timeout passes, we can rollback transfers in ledger1 and
>> ledger2 (and
>> ledger2 will ignore valid fulfillments since it's aware that ledger1 will
>> not
>> accept them anyway).
>>
>>  It also doesn't solve the case in which ledger2 fulfills the transfer
>> and the
>> fulfillment is lost in the way back to ledger1 due to some misbehaving
>> connector, but, I think, this could be considered a "weird" scenario.
>>
>>
>> Regards,
>>
>> *Enrique*
>>
>>
>> ------------------------------
>> *De:* David Fuelling [dfuelling@sappenin.com]
>> *Enviado:* miércoles, 19 de julio de 2017 3:42
>> *Para:* Enrique Arizon Benito; public-interledger@w3.org
>> *Asunto:* Re: Alternative algorithm for timeouts
>>
>> Hey Enrique, can you clarify the meaning of M, N, and K?
>>
>> Thanks!
>> David
>> On Fri, Jun 30, 2017 at 6:08 AM Enrique Arizon Benito <
>> enrique.arizon.benito@everis.com> wrote:
>>
>>> At this moment the algorithm to establish a timeout is something like:
>>>
>>>    1. start transaction
>>>    2. set timeout as "NOW" + Constant_Timeout_time
>>>    3. wait
>>>    4. "NOW" > timeout
>>>    1. YES -> timeout transaction
>>>       2. NOW -> Got to 3
>>>
>>> This introduces random false timeouts due to race-conditions. It could
>>> be possible for the receiving ledger to execute the transfer and for the
>>> originating ledger to time out due to the finite time to propagate the
>>> fulfillment back through all connectors.
>>>
>>>
>>> I think next alternative algorithm minimized (nearly avoids) all false
>>> timeouts due to race conditions?
>>>
>>> - The sending ledger keeps a list of txs not yet executed (that can
>>> timeout)  *for each* destination ledger.
>>>
>>>    Those txs are stored in an ordered list acording to the sending time
>>> [tx1, tx2, tx3, tx4, ...]
>>>
>>>
>>> - When tx"M" with M = N + K arrives, a timeout is established for tx"N"
>>> for each "N" < M - K
>>>
>>>
>>> - Finally after a given timeout,  tx"N" is cancelled
>>>
>>>
>>> Said it otherway:
>>>
>>> -  transaction in the list does NOT time-out unless two next condition
>>> happens:
>>>
>>>   - tx"M" fulfillment with M > N + K has arrived
>>>
>>>   - tx"N" has timed out.
>>>
>>>
>>>
>>> The logic is next:
>>>
>>>  - It's possible that at some point there is an overload in the
>>> destination ledger. At this point the average time to process the incomming
>>> transfer, executing and returning the fulfillment will increase approaching
>>> the initial timeout. The closer it is to the timeout the more probable for
>>> race condition during the "travel back".
>>>
>>>
>>> - Due to the destination ledger system overload, is quite possible that
>>> all TXs are delayed and it makes sense for the originating ledger to wait
>>> more than ussual.
>>> - Suppose now that tx"M" arrives. At this moment sending ledger setup a
>>> timeout for each tx"N" with N < M - K.
>>>
>>>
>>>   It's quite sensible to think that if there was a system overload in
>>> the destination ledger once it's back to normal all pending transfer will
>>> arrive shortly after.
>>>
>>>   It's also quite sensible to think that if tx"M" arrives, and timeout
>>> passes, most probably tx"N" has really timeout (it never reached the
>>> destinantion ledger) and that no race condition will arise since they never
>>> reached the destination ledger.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Enrique
>>>
>>
> ------------------------------
>
> AVISO DE CONFIDENCIALIDAD.
> Este correo y la información contenida o adjunta al mismo es privada y
> confidencial y va dirigida exclusivamente a su destinatario. everis informa
> a quien pueda haber recibido este correo por error que contiene información
> confidencial cuyo uso, copia, reproducción o distribución está expresamente
> prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por
> error, le rogamos lo ponga en conocimiento del emisor y proceda a su
> eliminación sin copiarlo, imprimirlo o utilizarlo de ningún modo.
>
> CONFIDENTIALITY WARNING.
> This message and the information contained in or attached to it are
> private and confidential and intended exclusively for the addressee. everis
> informs to whom it may receive it in error that it contains privileged
> information and its use, copy, reproduction or distribution is prohibited.
> If you are not an intended recipient of this E-mail, please notify the
> sender, delete it and do not read, act upon, print, disclose, copy, retain
> or redistribute any portion of this E-mail.
>
Received on Thursday, 20 July 2017 08:11:40 UTC