Re: Alternative algorithm for timeouts from Michiel de Jong on 2017-07-19 (public-interledger@w3.org from July 2017)

From: Michiel de Jong <michiel@unhosted.org>
Date: Wed, 19 Jul 2017 14:08:28 +0200
To: Enrique Arizon Benito <enrique.arizon.benito@everis.com>
Cc: David Fuelling <dfuelling@sappenin.com>, "public-interledger@w3.org" <public-interledger@w3.org>
Message-ID: <CA+aD3u1PQMT-HQF=FaGKQ9dfGKrGGkaX=Xy4fQk1DR8Rk4iSpQ@mail.gmail.com>
Hi Enrique,

I think I understand now what you're getting at; basically, if there is a
network outage, wait a bit longer before you time out the transaction.
Right? Your point reminds me of
https://github.com/interledger/rfcs/issues/159, and I guess it's one of the
more confusing things when you first start to think about how the
Interledger protocol works. The hard timeouts used by Interledger require
connectors to take a fulfillment risk: they have to set a minimal message
window, and if they set this value too low, it might be that in practice
they are too slow (because of server load, or network downtime), and they
lose money.

There is a reason though for the hard timeouts: it provides a predictable
end-to-end experience: delivery on time, or your money back. The fact that
connectors give you a sort of SLA and take care of the timeout risk, means
that you as a sender don't have to, and you can count on a hard guarantee.

You can't tell the next connector "it's ok if you're a bit late", because
that only works if your own previous connector is also so kind to you, and
so that's just moving the risk from one connector to the previous one.

You could also set a very high timeout, which only ever gets reached if the
comms links are really down. But that will only benefit the first few
connectors on the path, only up to the point that their previous connector
is willing to keep money on hold for them.

You could imagine a protocol (different from ILP) where the "delivery on
time, or your money back" is not a design goal. If the comms network is
usually quite reliable, payment senders/receivers are usually not in a
hurry (for instance, the network is designed for settling debt loops, which
is not a while-you-wait use case) and pretending to be down is not a way in
which neighboring connectors try to steal from each other (for instance,
because the inflight amount they could steal is much lower than their
ongoing business opportunity if they don't steal it), you could allow link
downtime in the comms network to cause longer delays at the end-to-end
level. You could then even, for instance, not specify a hard wallclock
timeout time at all up front, but just send a "this is taking too long,
please roll back" message to trigger the rollback of a payment that's
taking too long.

But in the case of ILP, I think there's no way around it, the sender gets a
guarantee, and all connectors along the path have to take a fulfillment
risk, in order to be able to provide this guarantee. If the sender sets a
short timeout, then the connector has to set a timeout for the next hop
that's even shorter.


Cheers,
Michiel.

On Wed, Jul 19, 2017 at 10:02 AM, Enrique Arizon Benito <
enrique.arizon.benito@everis.com> wrote:

>  I'll try to make it more visual.
>
>  Let's suppose senders initiating the transfers are on ledger1 and
> receivers are on ledger2.
>
>  Suppose "randomly" ledger1 initiates next transactions, ordered according
> to its *local* time.
>
>  In this diagram the dot '.' represents a unit of time of 1/15/... secs:
>
>   timeline tx created@ledger1:
>     |.........|.........|.........|.........|....
>      ^     ^    ^   ^            ^       ^
>     tx1   tx2  tx3 tx4          tx5     tx6
>
>
>   timeline default timeouts@ledger1:
>     |.........|.........|.........|.........|....
>             ^     ^    ^   ^            ^       ^
>            tx1   tx2  tx3 tx4          tx5     tx6
>
>
>   timeline txs received@ledger2:
>     |.........|.........|.........|.........|....
>       ^     ^     ^   ^            ^       ^
>      tx1   tx2   tx3 tx4          tx5     tx6
>
>
>   timeline unavailability@ledger2:
>     |.........|.........|.........|.........|....
>
>                        ^^^^^^^^^^^
>                        unavailable
>
>
>   timeline txs fulfilled@ledger2:
>     |.........|.........|.........|.........|....
>       ^     ^    ^                   ^   ^   ^
>      tx1   tx2  tx3                 tx5 tx4 tx6
>
>
>   timeline fulfillment received@ledger1:
>     |.........|.........|.........|.........|....
>        ^     ^     ^                   ^   ^  ^
>       tx1   tx2   tx3                 tx5 tx4tx6
>                                            ^
>                                           race
>                                         condition
>
>   In the previous diagram "unavailable" time represent any internal state
> in
> ledger2 that doesn't allow to process/fulfill the received ILP transfer (a
> reset, overload, ...).  In the blockchain case it could also the case that
> temporally miners fees are higher than normal, that will make the system
> "unavailable" to TXs sent with lower fees.
>
>   The problem with race-conditions arise with transfers arriving just
> before
> the unavailable window time in ledger2 (tx4 in this case).
>
>   tx4 will be fulfilled by ledger2 but when the fulfillment arrives to the
> originating ledger, the transfer is already expired (race condition).
>
>   Now imagine that the timeouts in ledger1 are established as explained in
> the
> previous mail.
>
>   Let's take K = 2, that means that timeout for tx4 doesn't start to count
> until tx6 or tx7 or ... has arrived. That is, until tx_4+K has arrived.
>
>   In this case, tx6 is processed quickly as expected when the system is
> running and not overloaded. tx6 fulfillment arrives to ledger1, and is at
> that
> moment that the timeout for tx4 starts to run. Since the fulfillment for
> tx4
> has already arrived shortly after tx5, the race condition dissapears.
>
>   Obviously there are other problems that can arise, like what happen if
> ledger2 is unresponsive "for long time" and the queue of prepared
> transfers in
> ledger1 to ledger2 starts to acumulate. In this case we can define another
> timeout, related to how much time we consider any ledger can be
> un-responsive.
> That is a timeout for ledger responses (versus a timeout for transfer
> fulfillments).
>
>   When such timeout passes, we can rollback transfers in ledger1 and
> ledger2 (and
> ledger2 will ignore valid fulfillments since it's aware that ledger1 will
> not
> accept them anyway).
>
>  It also doesn't solve the case in which ledger2 fulfills the transfer and
> the
> fulfillment is lost in the way back to ledger1 due to some misbehaving
> connector, but, I think, this could be considered a "weird" scenario.
>
>
> Regards,
>
> *Enrique*
>
>
> ------------------------------
> *De:* David Fuelling [dfuelling@sappenin.com]
> *Enviado:* miércoles, 19 de julio de 2017 3:42
> *Para:* Enrique Arizon Benito; public-interledger@w3.org
> *Asunto:* Re: Alternative algorithm for timeouts
>
> Hey Enrique, can you clarify the meaning of M, N, and K?
>
> Thanks!
> David
> On Fri, Jun 30, 2017 at 6:08 AM Enrique Arizon Benito <
> enrique.arizon.benito@everis.com> wrote:
>
>> At this moment the algorithm to establish a timeout is something like:
>>
>>    1. start transaction
>>    2. set timeout as "NOW" + Constant_Timeout_time
>>    3. wait
>>    4. "NOW" > timeout
>>    1. YES -> timeout transaction
>>       2. NOW -> Got to 3
>>
>> This introduces random false timeouts due to race-conditions. It could be
>> possible for the receiving ledger to execute the transfer and for the
>> originating ledger to time out due to the finite time to propagate the
>> fulfillment back through all connectors.
>>
>>
>> I think next alternative algorithm minimized (nearly avoids) all false
>> timeouts due to race conditions?
>>
>> - The sending ledger keeps a list of txs not yet executed (that can
>> timeout)  *for each* destination ledger.
>>
>>    Those txs are stored in an ordered list acording to the sending time
>> [tx1, tx2, tx3, tx4, ...]
>>
>>
>> - When tx"M" with M = N + K arrives, a timeout is established for tx"N"
>> for each "N" < M - K
>>
>>
>> - Finally after a given timeout,  tx"N" is cancelled
>>
>>
>> Said it otherway:
>>
>> -  transaction in the list does NOT time-out unless two next condition
>> happens:
>>
>>   - tx"M" fulfillment with M > N + K has arrived
>>
>>   - tx"N" has timed out.
>>
>>
>>
>> The logic is next:
>>
>>  - It's possible that at some point there is an overload in the
>> destination ledger. At this point the average time to process the incomming
>> transfer, executing and returning the fulfillment will increase approaching
>> the initial timeout. The closer it is to the timeout the more probable for
>> race condition during the "travel back".
>>
>>
>> - Due to the destination ledger system overload, is quite possible that
>> all TXs are delayed and it makes sense for the originating ledger to wait
>> more than ussual.
>> - Suppose now that tx"M" arrives. At this moment sending ledger setup a
>> timeout for each tx"N" with N < M - K.
>>
>>
>>   It's quite sensible to think that if there was a system overload in the
>> destination ledger once it's back to normal all pending transfer will
>> arrive shortly after.
>>
>>   It's also quite sensible to think that if tx"M" arrives, and timeout
>> passes, most probably tx"N" has really timeout (it never reached the
>> destinantion ledger) and that no race condition will arise since they never
>> reached the destination ledger.
>>
>>
>>
>> Regards,
>>
>> Enrique
>>
>>
>> ------------------------------
>>
>> AVISO DE CONFIDENCIALIDAD.
>> Este correo y la información contenida o adjunta al mismo es privada y
>> confidencial y va dirigida exclusivamente a su destinatario. everis informa
>> a quien pueda haber recibido este correo por error que contiene información
>> confidencial cuyo uso, copia, reproducción o distribución está expresamente
>> prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por
>> error, le rogamos lo ponga en conocimiento del emisor y proceda a su
>> eliminación sin copiarlo, imprimirlo o utilizarlo de ningún modo.
>>
>> CONFIDENTIALITY WARNING.
>> This message and the information contained in or attached to it are
>> private and confidential and intended exclusively for the addressee. everis
>> informs to whom it may receive it in error that it contains privileged
>> information and its use, copy, reproduction or distribution is prohibited.
>> If you are not an intended recipient of this E-mail, please notify the
>> sender, delete it and do not read, act upon, print, disclose, copy, retain
>> or redistribute any portion of this E-mail.
>>
>
> ------------------------------
>
> AVISO DE CONFIDENCIALIDAD.
> Este correo y la información contenida o adjunta al mismo es privada y
> confidencial y va dirigida exclusivamente a su destinatario. everis informa
> a quien pueda haber recibido este correo por error que contiene información
> confidencial cuyo uso, copia, reproducción o distribución está expresamente
> prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por
> error, le rogamos lo ponga en conocimiento del emisor y proceda a su
> eliminación sin copiarlo, imprimirlo o utilizarlo de ningún modo.
>
> CONFIDENTIALITY WARNING.
> This message and the information contained in or attached to it are
> private and confidential and intended exclusively for the addressee. everis
> informs to whom it may receive it in error that it contains privileged
> information and its use, copy, reproduction or distribution is prohibited.
> If you are not an intended recipient of this E-mail, please notify the
> sender, delete it and do not read, act upon, print, disclose, copy, retain
> or redistribute any portion of this E-mail.
>
Received on Wednesday, 19 July 2017 12:08:58 UTC