Is a faithful HTTP/2 response scheduler necessarily O(n) in the worst case?

I implemented the HTTP/2 response scheduler in Go's HTTP/2 server library.
I'm trying to understand the worst-case behavior of that scheduler. I
believe the worst-case behavior is necessarily O(n) operations per frame
sent on the wire, where n is the number of streams in the dependency tree.
Here is my rationale.

The key operation is finding the highest-priority stream that is ready to
send data.

If we don't care about weights, and we don't care about balancing bandwidth
usage across sibling nodes in a tree, then we can label each node with two
fields: "ready" (true if the stream is ready to send data) and "depth" (the
node's distance from the root of the tree). The scheduler must find a node
with the smallest depth over all nodes with ready = true. It is fairly
trivial to implement this in O(log n).

Now, let's introduce weights. The scheduler must allocate bandwidth to all
ready nodes, which happens recursively as follows:

  func allocateBandwidth(node, bw) {
    if (node.ready) {
      node.bandwidthShare = bw
      return
    }
    totalWeight = 0
    for (n in node.children) {
      if (n.ready || descendantIsReady(n)) {
        totalWeight += n.weight
      }
    }
    for (n in node.children) {
      if (n.ready || descendantIsReady(n)) {
        allocateBandwidth(n, bw * n.weight / totalWeight)
      }
    }
  }
  allocateBandwidth(root, 1.0)

I believe the above definition is a direct translation of RFC 7540 Section
5.3.2 (also see this thread
<https://lists.w3.org/Archives/Public/ietf-http-wg/2016OctDec/0601.html>,
which discussed bandwidth allocation). Given a complete bandwidth
allocation, the server can translate each node's bandwidthShare to a number
of tokens that are updated using a token bucket algorithm (or similar). The
scheduler would then pick the node with the most available tokens. The
scheduler looks something like:

  func scheduleNextFrame() {
    if (tree changed since last frame written) {
      allocateBandwidth(root, 1.0)
      assign tokens and build a priority queue containing all nodes with
allocated bandwidth
    }
    node = priorityQueue.head()
    node.consumeTokens() // consume up to frame size or flow-control limit
    priorityQueue.update(node)
    return node
  }

There are two steps in scheduleNextFrame. The first step updates the
bandwidth allocation if the tree changed since the last frame was written.
This is the most expensive step. I don't believe it's possible to implement
allocateBandwidth using fewer than O(n) worst-case steps. For example, if
the tree is mostly flat, meaning that most nodes are children of the same
node, then the loop to compute totalWeight is O(n). I believe Firefox
<http://bitsup.blogspot.com/2015/01/http2-dependency-priorities-in-firefox.html>
creates mostly-flat trees. Further, allocateBandwidth makes O(depth)
recursive steps, where "depth" is the maximum depth of the dependency tree.
If the tree is mostly-linear, then O(depth) becomes O(n). Chrome
<https://cs.chromium.org/chromium/src/net/spdy/http2_priority_dependencies.cc>
creates mostly-linear trees.

The overall runtime depends on how often the tree changes. If the tree
changes rarely, the scheduler is cheap. If the tree changes frequently, the
scheduler is worst-case O(n). A tree has "changed" if it changed shape
(nodes are added/moved/removed) or if any node's ready state has changed.
Both kinds of changes can happen often in practice, suggesting that the
overall scheduler is worst-case O(n). For example, consider a server that
sends small responses (one or two DATA frames per response) -- each stream
will be closed after one or two DATA frames, so on average, the tree will
change shape every few frames. Further, it's possible for ready states to
change more frequently than you might think. In Go's HTTP/2 implementation,
a stream does not become ready until the application handler calls Write to
buffer response data. After that buffered data is serialized on the wire,
the stream transitions to "not ready" because the scheduler cannot know
when the next Write call will happen. The stream will transition back to
"ready" during the next Write call. Each Write call typically buffers about
32KB of data. This is a "push" model, where the application handler
"pushes" data into the scheduler. I'm aware of one other HTTP/2 server that
works similarly. I suspect that frequent ready/not-ready transitions are
common to most HTTP/2 servers that use a "push" model. These servers will
be more susceptible to the O(n) worst case.

Questions:

1. Am I missing a clever implementation, or is it true that a faithful
HTTP/2 scheduler necessarily requires O(n) operations per frame sent on the
wire, in the worst case? I could not find much discussion of this question
after a quick search. H2O claims to implement an O(1) scheduler
<https://h2o.examp1e.net/configure/http2_directives.html>, however, the
code seems to be worst-case O(depth) or O(n) -- see here
<https://github.com/h2o/h2o/blob/9b70b61b0705ccce22a188e65363cdb506285ff1/lib/http2/scheduler.c#L326>
, here
<https://github.com/h2o/h2o/blob/9b70b61b0705ccce22a188e65363cdb506285ff1/lib/http2/scheduler.c#L342>,
and here
<https://github.com/h2o/h2o/blob/9b70b61b0705ccce22a188e65363cdb506285ff1/lib/http2/scheduler.c#L214>
.

2. If the above is correct, should I be concerned about the O(n) worst
case? I doubt that a typical web browsing session will trigger O(n)
behavior frequently, so I'm less concerned about the average case; I'm more
concerned about pathological cases or possible DoS vectors. Also, think
about cases where the "client" is actually a proxy server, meaning the
HTTP/2 connection may have many more concurrent streams than a typical
browsing session. For comparison, if you recall the predecessor to HTTP/2
(SPDY), a SPDY scheduler could be trivially implemented in O(1), since SPDY
used just eight priority buckets.

3. If I should be concerned about an O(n) worst case, are there any
suggested mitigations beyond setting SETTINGS_MAX_CONCURRENT_STREAMS to a
smallish constant?

Received on Tuesday, 24 January 2017 00:40:18 UTC