- From: Michael Toomim <toomim@gmail.com>
- Date: Thu, 10 Oct 2024 16:37:47 -0700
- To: Martin Kleppmann <martin@kleppmann.com>
- Cc: HTTP Working Group <ietf-http-wg@w3.org>, Braid <braid-http@googlegroups.com>, Peter van Hardenberg <pvh@pvh.ca>
- Message-ID: <a6ce578c-a08a-4abd-8c9a-f0e000bbe3d0@gmail.com>
Martin, thank you for engaging! It's valuable getting input on data synchronization from the author of Designing Data-Intensive Applications <https://medium.com/javarevisited/review-is-designing-data-intensive-applications-by-martin-kleppman-worth-it-b3b7dfa17a5c>. I believe we are addressing complementary aspects of the state synchronization problem, rather than "different approaches." Specifically: you're focusing on how peers reconcile and route an E2EE history of blobs; whereas I'm working on the structure within those blobs. Our work could integrate as follows: Envelope 1: Your work (routing protocol) Routing Headers Message Type Message 2: My work (message structure) Headers Version 3: This versioning spec Parents 3: This versioning spec Range Body New Value An Envelope is a structured container within a protocol that encapsulates opaque data across multiple hops. It's a feature of SMTP, for instance, that allows mail to go through relays, while GPG-encrypting the contents, and is also how BCC works. In my understanding, you are working on a routing & reconciliation protocol for envelopes that maintain content opacity. In my view, your work could be expressed in an extension to HTTP that allows HTTP messages to route through multiple peers. Some HTTP extensions already implement this concept implicitly. For example, OHAI (Oblivious HTTP) uses an encapsulated request that functions like an envelope — it includes routing information (gateway and target URLs), encrypts contents, and has authenticity elements (like a nonce). This enables OHAI to separate client identity from request content, facilitating privacy-preserving routing. The versioning spec discussed in this thread, on the other hand, addresses part 3 of the above diagram: event Versioning, which falls within part 2: Messages. Our efforts complement each other. Any synchronization protocol requires both message routing and message contents. However, I've noticed you use "synchronization protocol" to refer to just the reconciliation part: > I prefer to think about the sync protocol as a reconciliation between two sets of blobs, where the two peers are figuring out which blobs they need to send to each other so that, at the end, they both have all the blobs. I propose using "synchronization" to encompass the entire problem, including the CRDT bits. CRDTs are crucial to synchronization as they define how peers merge parallel edits in synchrony. I hope we can discuss both aspects. I would love to hear your thoughts on the versioning proposal, and I will now address your broader questions about P2P routing of HTTP messages: Martin Kleppmann wrote: > - If you want to atomically update multiple objects, would your > approach require multiple PUTs? Is there a risk of some of them being > applied and some being dropped if the network drops at an inopportune > moment? In our approach we simply encode multiple updates into a > single blob. An example for wanting atomicity: say you want to attach > a comment as an annotation to a span of text. In Automerge you would > do that by attaching a comment ID as a mark to a range of characters, > and then in a separate object you would map the comment ID to a JSON > object with the details of the comment (author, text, timestamp, reply > thread, etc). Yes, we want to support atomic mutations (e.g. "transactions") across objects. I'd like to pick a more difficult example, though, because annotations to spans of text do not intrinsically require two atomic writes to create. The annotation's "attachment" can just be a single field that points to a span of text at a version ID, such as {version: ["x72h"], range: "text 44:70"}. Then you don't need an intermediate object, don't mutate the text CRDT, and end up with just one object to mutate. Perhaps a stronger example is a Bank Account transfer. Suppose Bob wants to send $10 to Alice. We will debit -$100 from Bob's account, and credit +$100 to Alice's. Bob and Alice sign their mutations from different computers, and send them over the network: PUT /bob Version: "transaction-9bx38" Content-Range: json .account.balance Content-Length: 3 110 PUT /alice Version: "transaction-9bx38" Content-Range: json .account.balance Content-Length: 2 90 You propose enforcing atomicity by encoding both PUT messages within the same opaque envelope, which would eliminate scenarios where some peers have one message, but not the other. Unfortunately, this requires both mutations to be created atomically on the same computer. If Alice and Bob sign and send their transactions from separate peers, they will have separate envelopes, and we still have an atomicity problem to contend with. A more general way to address atomicity is via Versioning + Validation. Atomicity is about time, Versioning specifies time, and Validation can mark a version invalid until all parts of the transaction are available. In our case, Bob and Alice: * Choose a single Version ID (e.g. "transaction-9bx38") for both PUTs, to say that they happened atomically, at the same time. * Implement a validation rule (aka "precondition" in CRDT parlance) that says the mutation is not valid/enabled until both sides of the transaction have been received, and are signed by the appropriate parties. I believe Validated Versioning provides a more expressive framework for atomicity than Multi-Message Envelopes. We can do this with PUTs, if we extend them with a versioning spec (e.g. in this thread) and a validation spec (TBD). > - How would the HTTP-style requests map to a p2p setting? The PUT … > syntax seems to suggest an asymmetric, client-server style > relationship between the communicating nodes. I know you said that > Braid was p2p-compatible, but maybe the HTTP-style syntax just puts me > so much in a client-server mindset that it's not obvious to me how it > translates to communication between peers. This might be easier to understand visually, so I just recorded this video: https://braid.org/protocol/visualizing-http2p I hope that's helpful. It was my first time trying that. I'm happy to clarify anything that I hand-waved. The resources I used are here: 1. https://braid.org/antimatter#viz2 2. https://braid.org/antimatter#messages-summary > Why prescribe the HTTP-style message format and not just let each CRDT > implementation define its own serialisation format that is optimised > for its features? The goal is interoperability. You cannot get decentralization without interoperability. If you build a decentralized protocol that doesn't interoperate, you just create a new walled garden on top of your "p2p" protocol. Look at IPFS. My work makes CRDTs and OT interoperable. We now have a common protocol that any CRDT and OT algorithm can use, while independently optimizing their own features. (Yes, this is the Time Machine architecture that unifies OT and CRDT, which I am writing up, and your new paper cites a draft of.) Part of this is a general representation of time, specified in terms of Events and Versions, with a "version-type" that enables optimizations, without coupling implementations to each other's data structures. This versioning idea is in the spec for this thread, and is awaiting peer-review from experts like you. This common protocol for any CRDT or OT algorithm has many benefits. (1) It allows us to build CRDT algorithms that support multiple merge-types. (See Seph's reference-crdts work.) (2) It allows implementations to implement optimizations independently, while still guaranteeing consistency. (Consider EG-Walker. Each peer can implement a walker and its internal format differently.) (3) It allows implementations to summarize or prune some ranges of history independently, while still guaranteeing full consistency for merges through other ranges of time (like with antimatter), and (4) to request various ranges of history from one another in a standard way if they have dropped information that they want back. (5) It allows these operations to be implemented by generic infrastructure, such as CDNs, caches, and backup devices, without requiring them to implement any specific CRDT or OT algorithm. (6) We can also build debugging and programming tools that are able to inspect and support this history without knowing about a particular CRDT or OT algorithm. See the braid-chrome <https://github.com/braid-org/braid-chrome> devtools panel as an example. The goal is interoperability. It results in better performance, tools, and infrastructure; along with more widespread usage. This gets even better when we interoperate with HTTP. > I guess one thing that your approach supports is that when in > unencrypted mode, the server could generate a snapshot of the document > rather than serving a log of the edit history. However, our blob > approach allows that too: a server that is able to interpret the > contents of the blobs can also compress them into a snapshot and serve > it when required (we sometimes call this a "shallow clone" by analogy > to the similar concept in Git). But that is an optional extension to > the protocol; the core protocol can still work on uninterpreted blobs. Yes, there are important use-cases for both needs. However, one man's "core" is another man's "optional." May I propose the neutral principle of *separation of concerns*? The serialization/envelope/routing/reconciliation can be a separate concern from the message formatting. We don't need to agree on which concern is more "core." It's up to implementations to choose which specs they want to implement. Thank you, again. This discussion has been quite valuable to me. I hope you find value in it, as well! Michael
Received on Thursday, 10 October 2024 23:37:55 UTC