Re: Draft for Resumable Uploads from Roy T. Fielding on 2022-04-01 (ietf-http-wg@w3.org from April to June 2022)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Fri, 1 Apr 2022 10:30:27 -0700
To: Marius Kleidl <marius@transloadit.com>
Cc: ietf-http-wg@w3.org
Message-Id: <82FAD6B4-F72F-42E0-A72D-4BFAAB9668FD@gbiv.com>
> On Apr 1, 2022, at 2:48 AM, Marius Kleidl <marius@transloadit.com> wrote:
> 
> Hello HTTP working group,
> 
> we are all familiar with connectivity disruptions affecting our internet activities. One example is when a large file download is interrupted; say a 100 MB file download encounters a network loss after the client receives 70 MB. Fortunately, resumable HTTP downloads using range requests are a widely deployed standard feature that allows clients to fetch the remaining 30 MB only, saving time and resources for both endpoints. However, in the opposite direction, there is not a standard convention for resuming HTTP uploads.
> 
> Across the HTTP ecosystem there are several different approaches to providing resumable uploads. We are aware of at least one attempt to try and standardize an approach [1], but to our knowledge none have succeeded in being adopted and driven to conclusion.
> 
> We believe resumable uploads are a common problem and that there is value in a standard resumable upload approach. We've been working on a document [2] [3] that uses HTTP to solve what we believe to be the core problem set, while also allowing for extended use cases. We are bringing this to the list to understand if there is interest in the working group to solve the problem, and whether our document is a good basis for a solution.
> 
> In case you are interested in the background of this draft: The origin is within the tus project [4], which has been developing a HTTP-based protocol for resumable uploads [5] since 2013 (tus was also posted on this mailing list at the time [6]). Furthermore, we also provide various open-source implementations [7] to allow easy usage on the web, in mobile applications, desktop application, or server environments. tus has seen great adaption, proving that there is a demand for an open-source solution providing resumable uploads.
> 
> We hope to bring resumable uploads to more people. For this, adopting resumable uploads into HTTP would be a great step. There is also interest in including support for resumable uploads natively into platforms, like browsers and mobile SDKs, so that developers do not have to bring their own library for resumable uploads.
> 
> We have taken the main uploading process from our tus protocol and reworked it into a self-containing draft, which we want to present to you! As such, this draft can be seen as an evolution of our work on tus and as a step to increase availability of resumable uploads.
> 
> Thank you for any feedback in advance!
> 
> Best regards,
> Marius Kleidl
> 
> [1] https://lists.w3.org/Archives/Public/ietf-http-wg/2019JulSep/0066.html <https://lists.w3.org/Archives/Public/ietf-http-wg/2019JulSep/0066.html>
> [2] https://datatracker.ietf.org/doc/draft-tus-httpbis-resumable-uploads-protocol/ <https://datatracker.ietf.org/doc/draft-tus-httpbis-resumable-uploads-protocol/>
> [3] https://github.com/tus/tus-v2 <https://github.com/tus/tus-v2>
> [4] https://tus.io/ <https://tus.io/>
> [5] https://tus.io/protocols/resumable-upload.html <https://tus.io/protocols/resumable-upload.html>
> [6] https://mailarchive.ietf.org/arch/msg/httpbisa/I__B5Kc7h-1TvRRB9rmjY8tR-T0/ <https://mailarchive.ietf.org/arch/msg/httpbisa/I__B5Kc7h-1TvRRB9rmjY8tR-T0/>
> [7] https://tus.io/implementations.html <https://tus.io/implementations.html>
> 
This is probably not a good day to discuss this, but it is clear from the
draft that this is not using HTTP correctly.

tus-v2 assumes that there is a separate resource for uploading, as opposed to
targeting a resource and letting the server decide whether it can upload into
a temporary resource for that target. It doesn't indicate what the server
is to do with the data once it is uploaded, which implies this is just part
of a private agreement instead of a standard protocol.

Subsequent requests target the same upload resource, instead of targeting
a separate temporary resource in progress. This results is some seriously
confused semantics when the client ends with a DELETE targeting the resource
for uploading.

Changing the semantics of an existing method using a header field is only
interoperable if the new field can be ignored. That is not the case here
for a DELETE on the process URI.

Likewise, not targeting by resource (URI) interferes with resource-based
access control and authorization, and fails to distinguish between uploads
where the user agent knows where to PUT the data and those where the
user agent is asking the server to choose where to POST the data.

For example, what happens when the server includes multiple
user-authenticated subtrees and this user is only authorized to upload
to some of them?

A simple fix is to send the initial upload as a PUT (to a target URI for
the completed upload) or as a POST (to clearly allow the server to select
a destination). The server can indicate that it supports continuation by
providing a temporary URI in a 1xx response. This new target is essentially
a buffer with a URI. The client can then monitor/continue requests on the
new URI, cancel by sending DELETE to that new URI, or finalize the upload
by sending some final metadata (e.g., DIgest) to that new URI. Once final
(either my completing the original request or receiving a finalized on the
temporary URI, the server can move the received data to where the client
indicated and delete the temporary URI.

The temporary URI is the token -- there is no need for a separate identifier,
unless you want to recover from missed responses (i.e., be able to repeat
the same request multiple times and let the server decide when it was
already done, for which a general request-id would be more appropriate).

Furthermore, the above can be generalized to more useful cases where
very large uploads are needed in practice. All of the ones that I have seen
deployed for real reasons have been to solve load/scale/speed problems
elsewhere in a chain of intermediaries, not just to send a very large file
to an HTTP origin server (which the vast majority of servers can handle
just fine with HTTP/1.1 over TCP).

For example, sending terabytes of data to S3 in parallel uploads to
multiple services that are then reassembled within AWS. This requires a
design where the user agent requests instruction on how/where to upload
each part in parallel and the server reconstitutes the data upon receiving
finalization of every part. IOW, the initial method with Expect and a field
indicating how large the upload will be, resulting in a 1xx/3xx list of
temporary target URIs (or URI templates) selected by the server,
potentially on different origins, where each indicated range can be
resumably-uploaded in parallel and then finalized.

Note that, if you stick with HTTP semantics and URIs as identifiers, the
complex use case is just a generalization of the smaller case.

Cheers,

....Roy
Received on Friday, 1 April 2022 17:30:47 UTC