Re: Alternative algorithm for timeouts from Enrique Arizon Benito on 2017-07-20 (public-interledger@w3.org from July 2017)

From: Enrique Arizon Benito <enrique.arizon.benito@everis.com>
Date: Thu, 20 Jul 2017 07:12:35 +0000
To: Interledger Community Group <public-interledger@w3.org>
Message-ID: <C23179174213FF4A8B0ADF823E0D2DD0C93EA1CB@MBXEUR04.usersad.everis.int>

In particular I was thinking about a network outage/slow-down in the specific case when such outage/slow-down is caused by the receiving ledger that fulfills the transfer (versus intermediate ledgers that just propagate the transfer back-and-forth) during the time when the transfer arrives to the time when the fulfillment triggers the execution of payment.

This is the most risky scenario, because the transfer could be fulfilled from the point of view of the receiving ledger, and unfulfilled from the originating one.

I'm thinking the previous timeout algorith can even be more simple:

Let's suppose next conditions:
- "delivery in time" is defined to be a maximun of 10 seconds
- In normal circumstances it takes 0.1secs for the receiving ledger to fulfill an incomming transfer.
- It takes a 1 second for the transfer to "arrive" at ledger2.
- It takes another 1 second for the fulfillment to propagate back.

That means that in normal circumstances it takes a total of 1 + 0.1 + 1 = 2.1 seconds for the fulfillment to arrive, so the "delivery in time" is guaranteed.

It could be the case that in abnormal circumstances ledger2 is overloaded (or in the blockchain case, that the fees are greated than expected) and it takes 9 seconds to fulfill the transfer instead of 0.1 secs as ussual. In such case the total time will be 1 + 9 + 1 = 11 secs. Since this scenario will be unusual, it could be perfectly possible for ledger1 to wait slightly more to avoid the race-condition (fulfilled in ledger2, un-fulfilled in ledger1).

Now the key point: If ledger2 takes more than the expected "delivery in time" ***from the time a transfer arrives to the receiving ledger to the time the fulfillment is created***, then it ignores the fulfillment with a timeout and rollbacks. In such case an ILPError will be transmited back to the originating ledger1 and eventually it will also rollback. There is no discrepancy by ledgers and connectors.

This mode of operation (mostly) eliminates the race-condition but raises a usability problem when the transfer fails from the user point of ledger1 doing the payment. It will not loose any money, neither will a connector in the midst, but it will take more than "delivery in time" to be notified of the failure. I think this is mostly acceptable. Maybe it could be as simple as providing a feedback with a popup "The payment is taking an abnormal amount of time to be confirmed, please wait".

A bigger problem is that ledger1 does not know what "waiting slightly more" really means but different solutions exists. Some simple alternative algorithms:
- An ILPError arrives from ledger2 indicating the timeout.
This is the simplest case, and will never cause a race-condition, but it could be the case that ledger2 is completly blocked and can not even send an ILPerror back.

Then some alternative predictive (predictive == risk of "guessing the furute") algorithms can be used:

- If transferN has not arrived after creation + fixed Timeout cancel. (This is the current algorithm).

- If transferN has not arrived in time, wait to see if transferN+1, transferN+2 arrive and then finally cancel transferN. It fails if the traffic is very low (there is no transferN+1), but actually if transferN+1 fulfillment arrives and after a timeout transferN fulfillment has not, ledger1 can be pretty sure that the transfer failed.

- If last 100 transfers to ledger2 arrived in time, let's add a some extra time-window since ledger1 will have good reasons to trust this ledger.

- ....

Once we assume that sending ledger will use predictive algorithms to determine the timeout, the possibility for random race-conditions appears, so those predictive algorithms must be used just for the worst-case scenario (receiving ledger non-recovarable failure just after fulfillment and local execution and no error arriving to the originating ledger).

Summarizing, IMHO, the ledger1 timeout calculus used right now is quite arbitrary and augment the risk for intermediate connectors in case of receiving ledger overload with minor benefits for final users.

Regards,

Enrique

________________________________
De: Michiel de Jong [michiel@unhosted.org]
Enviado: miércoles, 19 de julio de 2017 14:08
Para: Enrique Arizon Benito
CC: David Fuelling; public-interledger@w3.org
Asunto: Re: Alternative algorithm for timeouts

Hi Enrique,

I think I understand now what you're getting at; basically, if there is a network outage, wait a bit longer before you time out the transaction. Right? Your point reminds me of https://github.com/interledger/rfcs/issues/159, and I guess it's one of the more confusing things when you first start to think about how the Interledger protocol works. The hard timeouts used by Interledger require connectors to take a fulfillment risk: they have to set a minimal message window, and if they set this value too low, it might be that in practice they are too slow (because of server load, or network downtime), and they lose money.

There is a reason though for the hard timeouts: it provides a predictable end-to-end experience: delivery on time, or your money back. The fact that connectors give you a sort of SLA and take care of the timeout risk, means that you as a sender don't have to, and you can count on a hard guarantee.

You can't tell the next connector "it's ok if you're a bit late", because that only works if your own previous connector is also so kind to you, and so that's just moving the risk from one connector to the previous one.

You could also set a very high timeout, which only ever gets reached if the comms links are really down. But that will only benefit the first few connectors on the path, only up to the point that their previous connector is willing to keep money on hold for them.

You could imagine a protocol (different from ILP) where the "delivery on time, or your money back" is not a design goal. If the comms network is usually quite reliable, payment senders/receivers are usually not in a hurry (for instance, the network is designed for settling debt loops, which is not a while-you-wait use case) and pretending to be down is not a way in which neighboring connectors try to steal from each other (for instance, because the inflight amount they could steal is much lower than their ongoing business opportunity if they don't steal it), you could allow link downtime in the comms network to cause longer delays at the end-to-end level. You could then even, for instance, not specify a hard wallclock timeout time at all up front, but just send a "this is taking too long, please roll back" message to trigger the rollback of a payment that's taking too long.

But in the case of ILP, I think there's no way around it, the sender gets a guarantee, and all connectors along the path have to take a fulfillment risk, in order to be able to provide this guarantee. If the sender sets a short timeout, then the connector has to set a timeout for the next hop that's even shorter.

Cheers,
Michiel.

On Wed, Jul 19, 2017 at 10:02 AM, Enrique Arizon Benito <enrique.arizon.benito@everis.com<mailto:enrique.arizon.benito@everis.com>> wrote:
I'll try to make it more visual.

Let's suppose senders initiating the transfers are on ledger1 and receivers are on ledger2.

Suppose "randomly" ledger1 initiates next transactions, ordered according to its *local* time.

In this diagram the dot '.' represents a unit of time of 1/15/... secs:

timeline tx created@ledger1:
|.........|.........|.........|.........|....
^ ^ ^ ^ ^ ^
tx1 tx2 tx3 tx4 tx5 tx6

timeline default timeouts@ledger1:
|.........|.........|.........|.........|....
^ ^ ^ ^ ^ ^
tx1 tx2 tx3 tx4 tx5 tx6

timeline txs received@ledger2:
|.........|.........|.........|.........|....
^ ^ ^ ^ ^ ^
tx1 tx2 tx3 tx4 tx5 tx6

timeline unavailability@ledger2:
|.........|.........|.........|.........|....
^^^^^^^^^^^
unavailable

timeline txs fulfilled@ledger2:
|.........|.........|.........|.........|....
^ ^ ^ ^ ^ ^
tx1 tx2 tx3 tx5 tx4 tx6

timeline fulfillment received@ledger1:
|.........|.........|.........|.........|....
^ ^ ^ ^ ^ ^
tx1 tx2 tx3 tx5 tx4tx6
^
race
condition

In the previous diagram "unavailable" time represent any internal state in
ledger2 that doesn't allow to process/fulfill the received ILP transfer (a
reset, overload, ...). In the blockchain case it could also the case that
temporally miners fees are higher than normal, that will make the system
"unavailable" to TXs sent with lower fees.

The problem with race-conditions arise with transfers arriving just before
the unavailable window time in ledger2 (tx4 in this case).

tx4 will be fulfilled by ledger2 but when the fulfillment arrives to the
originating ledger, the transfer is already expired (race condition).

Now imagine that the timeouts in ledger1 are established as explained in the
previous mail.

Let's take K = 2, that means that timeout for tx4 doesn't start to count
until tx6 or tx7 or ... has arrived. That is, until tx_4+K has arrived.

In this case, tx6 is processed quickly as expected when the system is
running and not overloaded. tx6 fulfillment arrives to ledger1, and is at that
moment that the timeout for tx4 starts to run. Since the fulfillment for tx4
has already arrived shortly after tx5, the race condition dissapears.

Obviously there are other problems that can arise, like what happen if
ledger2 is unresponsive "for long time" and the queue of prepared transfers in
ledger1 to ledger2 starts to acumulate. In this case we can define another
timeout, related to how much time we consider any ledger can be un-responsive.
That is a timeout for ledger responses (versus a timeout for transfer fulfillments).

When such timeout passes, we can rollback transfers in ledger1 and ledger2 (and
ledger2 will ignore valid fulfillments since it's aware that ledger1 will not
accept them anyway).

It also doesn't solve the case in which ledger2 fulfills the transfer and the
fulfillment is lost in the way back to ledger1 due to some misbehaving
connector, but, I think, this could be considered a "weird" scenario.

Regards,

Enrique

________________________________
De: David Fuelling [dfuelling@sappenin.com<mailto:dfuelling@sappenin.com>]
Enviado: miércoles, 19 de julio de 2017 3:42
Para: Enrique Arizon Benito; public-interledger@w3.org<mailto:public-interledger@w3.org>
Asunto: Re: Alternative algorithm for timeouts

Hey Enrique, can you clarify the meaning of M, N, and K?

Thanks!
David
On Fri, Jun 30, 2017 at 6:08 AM Enrique Arizon Benito <enrique.arizon.benito@everis.com<mailto:enrique.arizon.benito@everis.com>> wrote:
At this moment the algorithm to establish a timeout is something like:

1. start transaction
2. set timeout as "NOW" + Constant_Timeout_time
3. wait
4. "NOW" > timeout
* YES -> timeout transaction
* NOW -> Got to 3

This introduces random false timeouts due to race-conditions. It could be possible for the receiving ledger to execute the transfer and for the originating ledger to time out due to the finite time to propagate the fulfillment back through all connectors.

I think next alternative algorithm minimized (nearly avoids) all false timeouts due to race conditions?

- The sending ledger keeps a list of txs not yet executed (that can timeout) *for each* destination ledger.

Those txs are stored in an ordered list acording to the sending time [tx1, tx2, tx3, tx4, ...]

- When tx"M" with M = N + K arrives, a timeout is established for tx"N" for each "N" < M - K

- Finally after a given timeout, tx"N" is cancelled

Said it otherway:

- transaction in the list does NOT time-out unless two next condition happens:

- tx"M" fulfillment with M > N + K has arrived

- tx"N" has timed out.

The logic is next:

- It's possible that at some point there is an overload in the destination ledger. At this point the average time to process the incomming transfer, executing and returning the fulfillment will increase approaching the initial timeout. The closer it is to the timeout the more probable for race condition during the "travel back".

- Due to the destination ledger system overload, is quite possible that all TXs are delayed and it makes sense for the originating ledger to wait more than ussual.
- Suppose now that tx"M" arrives. At this moment sending ledger setup a timeout for each tx"N" with N < M - K.

It's quite sensible to think that if there was a system overload in the destination ledger once it's back to normal all pending transfer will arrive shortly after.

It's also quite sensible to think that if tx"M" arrives, and timeout passes, most probably tx"N" has really timeout (it never reached the destinantion ledger) and that no race condition will arise since they never reached the destination ledger.

Regards,

Enrique

________________________________

AVISO DE CONFIDENCIALIDAD.
Este correo y la información contenida o adjunta al mismo es privada y confidencial y va dirigida exclusivamente a su destinatario. everis informa a quien pueda haber recibido este correo por error que contiene información confidencial cuyo uso, copia, reproducción o distribución está expresamente prohibida. Si no es Vd. el destinatario del mismo y recibe este correo por error, le rogamos lo ponga en conocimiento del emisor y proceda a su eliminación sin copiarlo, imprimirlo o utilizarlo de ningún modo.

CONFIDENTIALITY WARNING.
This message and the information contained in or attached to it are private and confidential and intended exclusively for the addressee. everis informs to whom it may receive it in error that it contains privileged information and its use, copy, reproduction or distribution is prohibited. If you are not an intended recipient of this E-mail, please notify the sender, delete it and do not read, act upon, print, disclose, copy, retain or redistribute any portion of this E-mail.

Received on Thursday, 20 July 2017 07:13:03 UTC