Re: Submitted new I-D: Cache Digests for HTTP/2 from Kazuho Oku on 2016-01-12 (ietf-http-wg@w3.org from January to March 2016)

From: Kazuho Oku <kazuhooku@gmail.com>
Date: Tue, 12 Jan 2016 10:04:00 +0900
To: Alcides Viamontes E <alcidesv@zunzun.se>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CANatvzxcKS46iAqAdfBHuWPt5k3XkR79NDMPPtDakOb2jPAywA@mail.gmail.com>
2016-01-11 2:11 GMT+09:00 Alcides Viamontes E <alcidesv@zunzun.se>:
> Hello,
>
> My interest in the draft "Cache Digests for HTTP/2"
> https://datatracker.ietf.org/doc/draft-kazuho-h2-cache-digest/
> concerns the original, intended use case that Mr. Kakuho Oku and Mr.
> M. Nottingham cited. As the authors, I would like very much like to
> see this made a standard and implemented in browsers. However, I
> perceive a few issues. Beforehand, I apologize for this long email,
> for any gaps in my understanding of the subject, and for not being
> familiar with the language and procedures used in this list.

Thank you for your feedback!

> Here are the issues that I see:
>
> 1.- In its current wording, no information about which version of a
> representation the browser already has is present in the cache digest.
> That information can be included in the URL itself (cache busting),
> but then it becomes a concern for web-developers, adds complexity to
> their work, and bypasses the mechanisms that HTTP has in place for
> maintaining cache state.  It also increases space pressure in the the
> browser's cache as the server is left with no means to expire old
> cached contents in the browser.

That is a very good point.

Let me first discuss the restrictions of the cache model used by HTTP,
and then go on to discuss what we should do if we are to fix the point
you raised.

First about the restriction; the resources in the cache can be divided
into two groups: fresh and non-fresh.  A server should never push a
resource that is considered as fresh in the client's cache.  Clients
will not notice the push / the HTTP/2 allows client to discard such
push.  Therefore, a CACHE_DIGEST frame
must include a filter that marks the resources that are marked as
being fresh.  That is what the current draft specifies.

Next about the point of including version information (e.g.
Last-Modified, ETag) in the cache digest.  I believe we can add a
second Golomb-coded set to the frame that uses hash(URI + version
information) as the key.  A server can refer to the information to
determine whether if it should push a 304 response or a 200 response.

The downside is that the CACHE_DIGEST frame may become larger (if the
server sends many responses that would become non-fresh), so it might
be sensible to allow the client to decide if it should send the second
Golomb-coded set.

In addition, we should agree on how to push 304 response.  My
understanding is that HTTP/2 spec., is vague on this, and that there
has not yet been an agreement  between the client developers on how it
should be done.

Once that is solved, I think we should update the I-D to cover the
version information as well.

> 2.- There is no way for the server to know that a CACHE_DIGEST frame
> is coming immediately after a HEADERS frame. A server may trigger some
> processing already after the end of headers has been received, while
> making  further DATA frames available as a stream of data to the
> application. With CACHE_DIGEST frames, the cache aware server will
> have to delay processing until the end of the stream has been seen to
> be sure that no CACHE_DIGEST frame is coming, or would have to
> re-start processing on seeing the frame. Arguably this is not a big
> problem for GET requests with an empty body, but it would be nice if
> the spec didn't force the server to wait for the end of the stream.

Agreed.

There are three options here (the draft adopts option C):

a) send CACHE_DIGEST frame right before HEADERS
b) send CACHE_DIGEST frame at stream_id=zero, with the value of the
authority that should be associated to the digest included within the
frame
c) send CACHE_DIGEST frame right after HEADERS

B is clearly the easiest but would have a small impact on the consumed
bandwidth, since the authority needs to be sent separately.

In A, the server does not need to delay the processing of the request,
but needs to cache the value of the digest.

It would be great to discuss which of the three approach will be the
best solution in general (or if there could be other approaches).

> 3. - Traditionally, cache state information has been placed in HTTP
> header fields. A CACHE_DIGEST frame puts some of that information in a
> new place, which is sure to cause some pain to web developers and
> sys-admins trying to understand the behavior of their applications.

CACHE_DIGEST frame should not be a HTTP header, since including the
value in every HTTP request (as a header) will make the HTTP requests
huge.  Since the client's cache state changes as the server sends
responses, we cannot expect HPACK to effectively compress the
requests.  We should send cache digest only once per HTTP/2
connection.
(note that intermediaries are allowed to re-order the HTTP requests
sent from a client, so it is impossible to include the digest only in
the first HTTP request as a header).

The other reason is that the digest should be hop-by-hop.  The default
behavior of a proxy (that do not understand the extension) should be
to drop the digest.

> 4.- The draft assumes a somewhat more restricted scope of Push than
> allowed by the HTTP/2 spec, RFC7540, and to some extent, goes against
> current practice. Section 8.2 of RFC7540, "Server Push", says "The
> server MUST include a value in the :authority" pseudo-header field for
> which the server is authoritative". Section 10.1 defines server
> authority by referring to [RFC7230], Section 9.1. For the HTTPS case,
> a server is authoritative for a domain if it can present a certificate
> that covers that domain. To the point, RFC7540 does not forbids a
> server to push resources for different domains, provided that it has
> the right credentials. Pushing assets for a domain different than the
> one where the request is received is useful when considering the way
> web applications are structured today: many serve their application
> logic using a www.example.com domain, while serving their static
> assets at static.example.com . Therefore, upon receiving a request to
> www.example.com, a server may want to push resources for
> static.example.com. However, section 2.1 of the draft works against
> that use case.

Thank you for pointing that out.

I think that for plaintext HTTP we agree that the client needs to
associate the name of the authority to the digest that it sends
(including one of the three options discussed above).

Considering the case for HTTPS, may be we should better allow the
client whether or not to associate an authority.

> 5.- A last issue has to do with what to include in the cache digest.
> Mr.  Oku proposes to only push resources which are in the critical
> render path in his article at [1]. Correspondingly, the cache digest
> would only need to include those resources. Can we have a simple
> mechanism to control the cache digest contents?

It is obvious that providing a way to specify the resources that
should be included in the cache digest will let clients generate more
compact digest values.

The downside is that it would be difficult for server administrators
to _change_ a resource to become part of the digest.  Consider the
case where a server has send resource A that is not being marked as
part of the digest, and then the server administrator then changes the
configuration for resource A to be included in part of the digest.
The client will not include A in the digest it sends, since it is not
marked.  The server will push the A to the client since it is not
included in the digest.  (As discussed above) a client may discard the
resource being pushed.  So A will continued to be pushed every time a
new request is issued.

Considering such possibility, it would be less troublesome if we could
go without introducing a way to configure what should be included in
the digest.


> I can provide some data and some rough suggestions to address the issues above.
>
> How big would a cache digest be anyway?
> -------------------------------------------------------
>
> To address issues 2 and 3 we need to determine how constrained we are
> regarding space. We have made a little study[2] across 1300 sites
> submitted by performance-conscious site operators, and from there we
> can establish that while 50% of sites fetch between 25 and 110
> resources, it is not too rare to have sites doing more than 200 HTTP
> requests. If anything, that number is going to grow. Specially with
> HTTP/2. Let's then use 200 as a ballpark estimate of the number of
> items in a cache digest and start from there.
>
> The source that the draft includes for Golomb-coded-sets (GCS) hints
> that it is possible to use the number of bits in a Bloom filter as an
> upper bound for the size of the corresponding GCS. Therefore, with a
> digest of size 200, we would be using an upper bound of 200*1.44*512
> bits, which is around 18 kB is expressed as binary, and around 24 kB
> if expressed in ascii form, base64-encoded, assuming a false positive
> probability of 1/512.

Your calculation is wrong.  A 200-entry GCS (with 1/512 false positive
rate) will be slightly larger than 225 bytes (log2(512) * 200 bits) in
binary form.

> Notice that by using PUSH the browser may skip many of those requests.
> In our site (https://www.shimmercat.com), we have measured HTTP/2
> requests averaging at 60 bytes per request. Therefore, one may end up
> saving up to 200 * 60 = 12  kB in traffic, bringing down the previous
> numbers to 18 kB -12 kB =6 kB and 24 kB - 12 kB = 12 kB. I think that
> 12 kB is acceptable for a site with 200 requests, specially since
> HTTP/2 PUSH would greatly increase the data transfer density for those
> sites.
>
>
> Can we embed the cache digest in a header?
> ------------------------------------------------------------
>
> Having 24 kB of cache digest in a header may delay processing the
> request more than acceptable, since most servers will wait to get the
> entire header block before starting to create an answer. There is an
> alternative however, and that would be to put a  field with the cache
> digest in a request trailer, allowed with chunked transfer under
> HTTP/1.1 and in all streams with HTTP/2. The pros of having the cache
> digest in a header or trailer field are the following:  we don't break
> with the tradition of exchanging cache state through headers,  headers
> are visible to developers' tools, it would be possible to test things
> using polyfills and service workers while the browsers catch up with
> native implementations, no extensions to HTTP/2 are needed, and cache
> digests would become possible even over plain old HTTP/1.1. It can
> also be made a little more future-proof:
>
> In the headers:
>
>        cache-digest: trailers
>
> (the indication above is not needed however if the cache-digest-scope
> is used, see below)
>
> In the trailers:
>
>          cache-digest: data:application/golomb-coded-set;base64,.....
>
> The cons is that ascii is bigger than binary.
>
> Even if the CACHE_DIGEST frame is pursued, it would be nice to have
>
>          cache-digest: frame
>
> as part of the request (and this time in the headers section, not the
> trailers) for the server to recognize that a cache digest frame is
> coming and for developers to have a hint that said information is
> being transmitted between client and server.
>
>
> Distinguishing representation versions in the cache digest (Addressing point 1)
>
> ---------------------------------------------------------------------------------------------------------
>
> The GCS filter requires the client and the server to be able to
> compute the same hash key for a given resource and version. As far as
> I understand, having semantics here similar to if-modified-since would
> not be possible. But strong etags could be used when computing the
> key, therefore enabling the equivalent to if-none-match. Step 4 in the
> algorithm of section 2.1 of the draft could be extended to have the
> etag used together with the URL when taking the hash.
>
>
> Which representations should be part of the digest? (Addressing point 4 and 5)
> --------------------------------------------------------------------------------------------------------
>
> I suggest to introduce the concept of cache digest scope. Only
> representations which were given a cache digest scope would be made
> part of a cache digest. And the set of representations URLs to be
> included by the client in the digest would be the intersection of:
>
> 1. The set of representations that have the same cache digest scope in
> the browser's cache than the domain of the first request (the
> document), and
> 2. The set of representations in the browser's cache for which the
> server is considered authoritative.
>
> The cache digest scope would be unique per domain.
>
> In other words, it would look like the following:
>
> Client asks for
>                   https://www.example.com/
> Server answers, and adds a header
>   cache-digest-scope: example
> The server then answers or pushes
>    https://static.example.com/styles.css ,
>         it uses the same header
>            cache-digest-scope: example.
>
> The server also answers or pushes
>    https://media.example.com/hero-1.png,
>        but no cache-digest-scope is provided.
>
>
> .... some time after, when a new connection is established by the same
> client to fetch another page from the same domain:
>
> Client asks for (a different page)
>          https://www.example.com/page1.html ,
>         now the client specifies a header
>        cache-digest-scope: example
>         client also provides a cache digest with all
>         the resources that were assigned
>         the same cache digest scope by the server.
>         That digest would include the resource from
> https://static.example.com/styles.css
>         but not the one at
>                  https://media.example.com/hero-1.png
>
> The server answers and pushes a 304 not modified for
> https://static.example.com/styles.css ,
>         or a 200 with new contents, using a cache contents
>         aware PUSH_PROMISE frame.
>
>
> This  mechanism addresses 4 by allowing digests to extend over
> multiple domains, and addresses 5 by allowing the server to control
> which assets are part of the digest: resources *without* the
> "cache-digest-scope" header are never made part of the digest. Also,
> the holder of a wildcard certificate can still use it to host separate
> multi-domain applications, for example (app1.example.com,
> static1.example.com with cache digest scope "1") and
> (app2.example.com, static2.example.com with cache digest scope "2"),
> without fearing the cache digest to grow too big. Furthermore, if a
> server doesn't implement PUSH or otherwise doesn't use the cache
> digest, it implicitly opts out of cache digests,  saving bandwidth.
>
> The cache-digest-scope: xxxx header would be idem in most requests and
> responses, and HPACK in HTTP/2 could compress it to a few bytes by
> using the dynamic table.
>
> Best regards,
>
> ----
> Alcides.
>
>
> [1] http://blog.kazuhooku.com/2015/12/optimizing-performance-of-multi-tiered.html
> [2] http://nbviewer.ipython.org/github/shimmercat/art_timings/blob/master/TimingsOfResourceLoads.ipynb
>



-- 
Kazuho Oku
Received on Tuesday, 12 January 2016 01:04:29 UTC