Re: Review of draft-toomim-httpbis-versions-00

Martin, thank you for engaging! It's valuable getting input on data 
synchronization from the author of Designing Data-Intensive Applications 
<https://medium.com/javarevisited/review-is-designing-data-intensive-applications-by-martin-kleppman-worth-it-b3b7dfa17a5c>.

I believe we are addressing complementary aspects of the state 
synchronization problem, rather than "different approaches." 
Specifically: you're focusing on how peers reconcile and route an E2EE 
history of blobs; whereas I'm working on the structure within those blobs.

Our work could integrate as follows:

    Envelope            1: Your work (routing protocol)
       Routing Headers
       Message Type
       Message           2: My work (message structure)
         Headers
           Version       3: This versioning spec
           Parents       3: This versioning spec
           Range
         Body
           New Value

An Envelope is a structured container within a protocol that 
encapsulates opaque data across multiple hops. It's a feature of SMTP, 
for instance, that allows mail to go through relays, while 
GPG-encrypting the contents, and is also how BCC works. In my 
understanding, you are working on a routing & reconciliation protocol 
for envelopes that maintain content opacity.

In my view, your work could be expressed in an extension to HTTP that 
allows HTTP messages to route through multiple peers. Some HTTP 
extensions already implement this concept implicitly. For example, OHAI 
(Oblivious HTTP) uses an encapsulated request that functions like an 
envelope — it includes routing information (gateway and target URLs), 
encrypts contents, and has authenticity elements (like a nonce). This 
enables OHAI to separate client identity from request content, 
facilitating privacy-preserving routing.

The versioning spec discussed in this thread, on the other hand, 
addresses part 3 of the above diagram: event Versioning, which falls 
within part 2: Messages.

Our efforts complement each other. Any synchronization protocol requires 
both message routing and message contents. However, I've noticed you use 
"synchronization protocol" to refer to just the reconciliation part:

     > I prefer to think about the sync protocol as a reconciliation
    between two sets of blobs, where the two peers are figuring out
    which blobs they need to send to each other so that, at the end,
    they both have all the blobs.

I propose using "synchronization" to encompass the entire problem, 
including the CRDT bits. CRDTs are crucial to synchronization as they 
define how peers merge parallel edits in synchrony.

I hope we can discuss both aspects. I would love to hear your thoughts 
on the versioning proposal, and I will now address your broader 
questions about P2P routing of HTTP messages:

Martin Kleppmann wrote:

> - If you want to atomically update multiple objects, would your 
> approach require multiple PUTs? Is there a risk of some of them being 
> applied and some being dropped if the network drops at an inopportune 
> moment? In our approach we simply encode multiple updates into a 
> single blob. An example for wanting atomicity: say you want to attach 
> a comment as an annotation to a span of text. In Automerge you would 
> do that by attaching a comment ID as a mark to a range of characters, 
> and then in a separate object you would map the comment ID to a JSON 
> object with the details of the comment (author, text, timestamp, reply 
> thread, etc).


Yes, we want to support atomic mutations (e.g. "transactions") across 
objects.

I'd like to pick a more difficult example, though, because annotations 
to spans of text do not intrinsically require two atomic writes to 
create. The annotation's "attachment" can just be a single field that 
points to a span of text at a version ID, such as {version: ["x72h"], 
range: "text 44:70"}. Then you don't need an intermediate object, don't 
mutate the text CRDT, and end up with just one object to mutate.

Perhaps a stronger example is a Bank Account transfer. Suppose Bob wants 
to send $10 to Alice. We will debit -$100 from Bob's account, and credit 
+$100 to Alice's. Bob and Alice sign their mutations from different 
computers, and send them over the network:

    PUT /bob
    Version: "transaction-9bx38"
    Content-Range: json .account.balance
    Content-Length: 3

    110


    PUT /alice
    Version: "transaction-9bx38"
    Content-Range: json .account.balance
    Content-Length: 2

    90

You propose enforcing atomicity by encoding both PUT messages within the 
same opaque envelope, which would eliminate scenarios where some peers 
have one message, but not the other. Unfortunately, this requires both 
mutations to be created atomically on the same computer. If Alice and 
Bob sign and send their transactions from separate peers, they will have 
separate envelopes, and we still have an atomicity problem to contend with.

A more general way to address atomicity is via Versioning + Validation. 
Atomicity is about time, Versioning specifies time, and Validation can 
mark a version invalid until all parts of the transaction are available. 
In our case, Bob and Alice:

  * Choose a single Version ID (e.g. "transaction-9bx38") for both PUTs,
    to say that they happened atomically, at the same time.
  * Implement a validation rule (aka "precondition" in CRDT parlance)
    that says the mutation is not valid/enabled until both sides of the
    transaction have been received, and are signed by the appropriate
    parties.

I believe Validated Versioning provides a more expressive framework for 
atomicity than Multi-Message Envelopes. We can do this with PUTs, if we 
extend them with a versioning spec (e.g. in this thread) and a 
validation spec (TBD).

> - How would the HTTP-style requests map to a p2p setting? The PUT … 
> syntax seems to suggest an asymmetric, client-server style 
> relationship between the communicating nodes. I know you said that 
> Braid was p2p-compatible, but maybe the HTTP-style syntax just puts me 
> so much in a client-server mindset that it's not obvious to me how it 
> translates to communication between peers.


This might be easier to understand visually, so I just recorded this video:

    https://braid.org/protocol/visualizing-http2p

I hope that's helpful. It was my first time trying that. I'm happy to 
clarify anything that I hand-waved. The resources I used are here:

    1. https://braid.org/antimatter#viz2
    2. https://braid.org/antimatter#messages-summary

> Why prescribe the HTTP-style message format and not just let each CRDT 
> implementation define its own serialisation format that is optimised 
> for its features?

The goal is interoperability. You cannot get decentralization without 
interoperability. If you build a decentralized protocol that doesn't 
interoperate, you just create a new walled garden on top of your "p2p" 
protocol. Look at IPFS.

My work makes CRDTs and OT interoperable. We now have a common protocol 
that any CRDT and OT algorithm can use, while independently optimizing 
their own features. (Yes, this is the Time Machine architecture that 
unifies OT and CRDT, which I am writing up, and your new paper cites a 
draft of.) Part of this is a general representation of time, specified 
in terms of Events and Versions, with a "version-type" that enables 
optimizations, without coupling implementations to each other's data 
structures. This versioning idea is in the spec for this thread, and is 
awaiting peer-review from experts like you.

This common protocol for any CRDT or OT algorithm has many benefits. (1) 
It allows us to build CRDT algorithms that support multiple merge-types. 
(See Seph's reference-crdts work.) (2) It allows implementations to 
implement optimizations independently, while still guaranteeing 
consistency. (Consider EG-Walker. Each peer can implement a walker and 
its internal format differently.) (3) It allows implementations to 
summarize or prune some ranges of history independently, while still 
guaranteeing full consistency for merges through other ranges of time 
(like with antimatter), and (4) to request various ranges of history 
from one another in a standard way if they have dropped information that 
they want back. (5) It allows these operations to be implemented by 
generic infrastructure, such as CDNs, caches, and backup devices, 
without requiring them to implement any specific CRDT or OT algorithm. 
(6) We can also build debugging and programming tools that are able to 
inspect and support this history without knowing about a particular CRDT 
or OT algorithm. See the braid-chrome 
<https://github.com/braid-org/braid-chrome> devtools panel as an example.

The goal is interoperability. It results in better performance, tools, 
and infrastructure; along with more widespread usage. This gets even 
better when we interoperate with HTTP.

> I guess one thing that your approach supports is that when in 
> unencrypted mode, the server could generate a snapshot of the document 
> rather than serving a log of the edit history. However, our blob 
> approach allows that too: a server that is able to interpret the 
> contents of the blobs can also compress them into a snapshot and serve 
> it when required (we sometimes call this a "shallow clone" by analogy 
> to the similar concept in Git). But that is an optional extension to 
> the protocol; the core protocol can still work on uninterpreted blobs.


Yes, there are important use-cases for both needs. However, one man's 
"core" is another man's "optional." May I propose the neutral principle 
of *separation of concerns*? The 
serialization/envelope/routing/reconciliation can be a separate concern 
from the message formatting. We don't need to agree on which concern is 
more "core." It's up to implementations to choose which specs they want 
to implement.

Thank you, again. This discussion has been quite valuable to me. I hope 
you find value in it, as well!

Michael

Received on Thursday, 10 October 2024 23:37:55 UTC