- From: Tom Bergan <tombergan@chromium.org>
- Date: Mon, 23 Jan 2017 16:39:42 -0800
- To: HTTP Working Group <ietf-http-wg@w3.org>
- Message-ID: <CA+3+x5Ft8MfRduWp1RzJ9_qAgQiCujac8f5FLz81Xpyu3cTDxw@mail.gmail.com>
I implemented the HTTP/2 response scheduler in Go's HTTP/2 server library. I'm trying to understand the worst-case behavior of that scheduler. I believe the worst-case behavior is necessarily O(n) operations per frame sent on the wire, where n is the number of streams in the dependency tree. Here is my rationale. The key operation is finding the highest-priority stream that is ready to send data. If we don't care about weights, and we don't care about balancing bandwidth usage across sibling nodes in a tree, then we can label each node with two fields: "ready" (true if the stream is ready to send data) and "depth" (the node's distance from the root of the tree). The scheduler must find a node with the smallest depth over all nodes with ready = true. It is fairly trivial to implement this in O(log n). Now, let's introduce weights. The scheduler must allocate bandwidth to all ready nodes, which happens recursively as follows: func allocateBandwidth(node, bw) { if (node.ready) { node.bandwidthShare = bw return } totalWeight = 0 for (n in node.children) { if (n.ready || descendantIsReady(n)) { totalWeight += n.weight } } for (n in node.children) { if (n.ready || descendantIsReady(n)) { allocateBandwidth(n, bw * n.weight / totalWeight) } } } allocateBandwidth(root, 1.0) I believe the above definition is a direct translation of RFC 7540 Section 5.3.2 (also see this thread <https://lists.w3.org/Archives/Public/ietf-http-wg/2016OctDec/0601.html>, which discussed bandwidth allocation). Given a complete bandwidth allocation, the server can translate each node's bandwidthShare to a number of tokens that are updated using a token bucket algorithm (or similar). The scheduler would then pick the node with the most available tokens. The scheduler looks something like: func scheduleNextFrame() { if (tree changed since last frame written) { allocateBandwidth(root, 1.0) assign tokens and build a priority queue containing all nodes with allocated bandwidth } node = priorityQueue.head() node.consumeTokens() // consume up to frame size or flow-control limit priorityQueue.update(node) return node } There are two steps in scheduleNextFrame. The first step updates the bandwidth allocation if the tree changed since the last frame was written. This is the most expensive step. I don't believe it's possible to implement allocateBandwidth using fewer than O(n) worst-case steps. For example, if the tree is mostly flat, meaning that most nodes are children of the same node, then the loop to compute totalWeight is O(n). I believe Firefox <http://bitsup.blogspot.com/2015/01/http2-dependency-priorities-in-firefox.html> creates mostly-flat trees. Further, allocateBandwidth makes O(depth) recursive steps, where "depth" is the maximum depth of the dependency tree. If the tree is mostly-linear, then O(depth) becomes O(n). Chrome <https://cs.chromium.org/chromium/src/net/spdy/http2_priority_dependencies.cc> creates mostly-linear trees. The overall runtime depends on how often the tree changes. If the tree changes rarely, the scheduler is cheap. If the tree changes frequently, the scheduler is worst-case O(n). A tree has "changed" if it changed shape (nodes are added/moved/removed) or if any node's ready state has changed. Both kinds of changes can happen often in practice, suggesting that the overall scheduler is worst-case O(n). For example, consider a server that sends small responses (one or two DATA frames per response) -- each stream will be closed after one or two DATA frames, so on average, the tree will change shape every few frames. Further, it's possible for ready states to change more frequently than you might think. In Go's HTTP/2 implementation, a stream does not become ready until the application handler calls Write to buffer response data. After that buffered data is serialized on the wire, the stream transitions to "not ready" because the scheduler cannot know when the next Write call will happen. The stream will transition back to "ready" during the next Write call. Each Write call typically buffers about 32KB of data. This is a "push" model, where the application handler "pushes" data into the scheduler. I'm aware of one other HTTP/2 server that works similarly. I suspect that frequent ready/not-ready transitions are common to most HTTP/2 servers that use a "push" model. These servers will be more susceptible to the O(n) worst case. Questions: 1. Am I missing a clever implementation, or is it true that a faithful HTTP/2 scheduler necessarily requires O(n) operations per frame sent on the wire, in the worst case? I could not find much discussion of this question after a quick search. H2O claims to implement an O(1) scheduler <https://h2o.examp1e.net/configure/http2_directives.html>, however, the code seems to be worst-case O(depth) or O(n) -- see here <https://github.com/h2o/h2o/blob/9b70b61b0705ccce22a188e65363cdb506285ff1/lib/http2/scheduler.c#L326> , here <https://github.com/h2o/h2o/blob/9b70b61b0705ccce22a188e65363cdb506285ff1/lib/http2/scheduler.c#L342>, and here <https://github.com/h2o/h2o/blob/9b70b61b0705ccce22a188e65363cdb506285ff1/lib/http2/scheduler.c#L214> . 2. If the above is correct, should I be concerned about the O(n) worst case? I doubt that a typical web browsing session will trigger O(n) behavior frequently, so I'm less concerned about the average case; I'm more concerned about pathological cases or possible DoS vectors. Also, think about cases where the "client" is actually a proxy server, meaning the HTTP/2 connection may have many more concurrent streams than a typical browsing session. For comparison, if you recall the predecessor to HTTP/2 (SPDY), a SPDY scheduler could be trivially implemented in O(1), since SPDY used just eight priority buckets. 3. If I should be concerned about an O(n) worst case, are there any suggested mitigations beyond setting SETTINGS_MAX_CONCURRENT_STREAMS to a smallish constant?
Received on Tuesday, 24 January 2017 00:40:18 UTC