Re: Submitted new I-D: Cache Digests for HTTP/2 from Kazuho Oku on 2016-01-12 (ietf-http-wg@w3.org from January to March 2016)

From: Kazuho Oku <kazuhooku@gmail.com>
Date: Wed, 13 Jan 2016 06:33:32 +0900
To: Stefan Eissing <stefan.eissing@greenbytes.de>
Cc: HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <CANatvzyoCec9Lu70EXJ47jssAK6kBk+dm+E6ozBgRLQ79yMFXQ@mail.gmail.com>
2016-01-13 1:31 GMT+09:00 Stefan Eissing <stefan.eissing@greenbytes.de>:
> Kazuho,
>
> while pondering my implementation: ch. 2.1 step 8 is different from the algorithm described in http://giovanni.bajo.it/post/47119962313/golomb-coded-sets-smaller-than-bloom-filters
>
> Your draft calculates (as I read it, I could be wrong)
> A.  for i in 0..len-2; do D:="hash-values[i+1] - hash-values[i] - 1"; ....; done
>
> while the latter does
> B.  for i in 0..len-1; do D:="hash-values[i] - (i > 0)? hash-values[i-1] : 0)"; ....; done
>
> I am not sure, on decompression, how to obtain the first hash value back from A. In B it just is the first delta. Could you give me a hint?

The algorithm intended here is as follows.  Please let me check the spec.

uint64_t next_min = 0;
for (i = 0; i < hash_values.length; ++i) {
  D = hash_values[i] - next_min;
  ...
  next_min = hash_values[i] + 1;
}

The corresponding code in my implementation is
https://github.com/kazuho/golombset/blob/e7cae04/golombset.h#L154-L158

> Thanks!
>
> -Stefan
>
>> Am 12.01.2016 um 10:01 schrieb Alcides Viamontes E <alcidesv@zunzun.se>:
>>
>> Hello,
>>
>> Thanks for your response Kazuho. I need to do some thinking before
>> fully addressing your comments, since you add valuable information
>> that I didn't consider before. At the risk of adding some noise
>> (please forgive me for that), I will write a few quick remarks:
>>
>>> Your calculation is wrong.  A 200-entry GCS (with 1/512 false positive
>>> rate) will be slightly larger than 225 bytes (log2(512) * 200 bits) in
>>> binary form.
>>
>> That's good news! My Google's cookie is 1246 bytes long. If we are not
>> talking about several kilobytes, then the restrictions are less.
>>
>> I would like to know more about the expectations for intermediaries.
>> As of today, HTTP/2 is working with TLS, therefore the website
>> operator needs to bless the HTTP/2 edge server with the SSL
>> certificate's private key. Maybe we can hope that website operators
>> will choose an HTTP/2 stack that does what he/she intends? In other
>> words, I think we shouldn't make the spec more difficult to use just
>> to accommodate potential issues with intermediaries. In that light, we
>> could perhaps require intermediaries to use a cache digest mindfully.
>> In any case please forgive me for my lack of data, this is just food
>> for thoughts and I will be glad to know more about how things look
>> right now with HTTP/2 intermediaries and caches.
>>
>> I will write a more detailed follow-up a few weeks from now. I will
>> also try to make a polyfill implementation using service workers and
>> ShimmerCat to learn how this looks in practice.
>>
>> Bests,
>>
>> ./Alcides.
>>
>>
>> On Tue, Jan 12, 2016 at 2:04 AM, Kazuho Oku <kazuhooku@gmail.com> wrote:
>>> 2016-01-11 2:11 GMT+09:00 Alcides Viamontes E <alcidesv@zunzun.se>:
>>>> Hello,
>>>>
>>>> My interest in the draft "Cache Digests for HTTP/2"
>>>> https://datatracker.ietf.org/doc/draft-kazuho-h2-cache-digest/
>>>> concerns the original, intended use case that Mr. Kakuho Oku and Mr.
>>>> M. Nottingham cited. As the authors, I would like very much like to
>>>> see this made a standard and implemented in browsers. However, I
>>>> perceive a few issues. Beforehand, I apologize for this long email,
>>>> for any gaps in my understanding of the subject, and for not being
>>>> familiar with the language and procedures used in this list.
>>>
>>> Thank you for your feedback!
>>>
>>>> Here are the issues that I see:
>>>>
>>>> 1.- In its current wording, no information about which version of a
>>>> representation the browser already has is present in the cache digest.
>>>> That information can be included in the URL itself (cache busting),
>>>> but then it becomes a concern for web-developers, adds complexity to
>>>> their work, and bypasses the mechanisms that HTTP has in place for
>>>> maintaining cache state.  It also increases space pressure in the the
>>>> browser's cache as the server is left with no means to expire old
>>>> cached contents in the browser.
>>>
>>> That is a very good point.
>>>
>>> Let me first discuss the restrictions of the cache model used by HTTP,
>>> and then go on to discuss what we should do if we are to fix the point
>>> you raised.
>>>
>>> First about the restriction; the resources in the cache can be divided
>>> into two groups: fresh and non-fresh.  A server should never push a
>>> resource that is considered as fresh in the client's cache.  Clients
>>> will not notice the push / the HTTP/2 allows client to discard such
>>> push.  Therefore, a CACHE_DIGEST frame
>>> must include a filter that marks the resources that are marked as
>>> being fresh.  That is what the current draft specifies.
>>>
>>> Next about the point of including version information (e.g.
>>> Last-Modified, ETag) in the cache digest.  I believe we can add a
>>> second Golomb-coded set to the frame that uses hash(URI + version
>>> information) as the key.  A server can refer to the information to
>>> determine whether if it should push a 304 response or a 200 response.
>>>
>>> The downside is that the CACHE_DIGEST frame may become larger (if the
>>> server sends many responses that would become non-fresh), so it might
>>> be sensible to allow the client to decide if it should send the second
>>> Golomb-coded set.
>>>
>>> In addition, we should agree on how to push 304 response.  My
>>> understanding is that HTTP/2 spec., is vague on this, and that there
>>> has not yet been an agreement  between the client developers on how it
>>> should be done.
>>>
>>> Once that is solved, I think we should update the I-D to cover the
>>> version information as well.
>>>
>>>> 2.- There is no way for the server to know that a CACHE_DIGEST frame
>>>> is coming immediately after a HEADERS frame. A server may trigger some
>>>> processing already after the end of headers has been received, while
>>>> making  further DATA frames available as a stream of data to the
>>>> application. With CACHE_DIGEST frames, the cache aware server will
>>>> have to delay processing until the end of the stream has been seen to
>>>> be sure that no CACHE_DIGEST frame is coming, or would have to
>>>> re-start processing on seeing the frame. Arguably this is not a big
>>>> problem for GET requests with an empty body, but it would be nice if
>>>> the spec didn't force the server to wait for the end of the stream.
>>>
>>> Agreed.
>>>
>>> There are three options here (the draft adopts option C):
>>>
>>> a) send CACHE_DIGEST frame right before HEADERS
>>> b) send CACHE_DIGEST frame at stream_id=zero, with the value of the
>>> authority that should be associated to the digest included within the
>>> frame
>>> c) send CACHE_DIGEST frame right after HEADERS
>>>
>>> B is clearly the easiest but would have a small impact on the consumed
>>> bandwidth, since the authority needs to be sent separately.
>>>
>>> In A, the server does not need to delay the processing of the request,
>>> but needs to cache the value of the digest.
>>>
>>> It would be great to discuss which of the three approach will be the
>>> best solution in general (or if there could be other approaches).
>>>
>>>> 3. - Traditionally, cache state information has been placed in HTTP
>>>> header fields. A CACHE_DIGEST frame puts some of that information in a
>>>> new place, which is sure to cause some pain to web developers and
>>>> sys-admins trying to understand the behavior of their applications.
>>>
>>> CACHE_DIGEST frame should not be a HTTP header, since including the
>>> value in every HTTP request (as a header) will make the HTTP requests
>>> huge.  Since the client's cache state changes as the server sends
>>> responses, we cannot expect HPACK to effectively compress the
>>> requests.  We should send cache digest only once per HTTP/2
>>> connection.
>>> (note that intermediaries are allowed to re-order the HTTP requests
>>> sent from a client, so it is impossible to include the digest only in
>>> the first HTTP request as a header).
>>>
>>> The other reason is that the digest should be hop-by-hop.  The default
>>> behavior of a proxy (that do not understand the extension) should be
>>> to drop the digest.
>>>
>>>> 4.- The draft assumes a somewhat more restricted scope of Push than
>>>> allowed by the HTTP/2 spec, RFC7540, and to some extent, goes against
>>>> current practice. Section 8.2 of RFC7540, "Server Push", says "The
>>>> server MUST include a value in the :authority" pseudo-header field for
>>>> which the server is authoritative". Section 10.1 defines server
>>>> authority by referring to [RFC7230], Section 9.1. For the HTTPS case,
>>>> a server is authoritative for a domain if it can present a certificate
>>>> that covers that domain. To the point, RFC7540 does not forbids a
>>>> server to push resources for different domains, provided that it has
>>>> the right credentials. Pushing assets for a domain different than the
>>>> one where the request is received is useful when considering the way
>>>> web applications are structured today: many serve their application
>>>> logic using a www.example.com domain, while serving their static
>>>> assets at static.example.com . Therefore, upon receiving a request to
>>>> www.example.com, a server may want to push resources for
>>>> static.example.com. However, section 2.1 of the draft works against
>>>> that use case.
>>>
>>> Thank you for pointing that out.
>>>
>>> I think that for plaintext HTTP we agree that the client needs to
>>> associate the name of the authority to the digest that it sends
>>> (including one of the three options discussed above).
>>>
>>> Considering the case for HTTPS, may be we should better allow the
>>> client whether or not to associate an authority.
>>>
>>>> 5.- A last issue has to do with what to include in the cache digest.
>>>> Mr.  Oku proposes to only push resources which are in the critical
>>>> render path in his article at [1]. Correspondingly, the cache digest
>>>> would only need to include those resources. Can we have a simple
>>>> mechanism to control the cache digest contents?
>>>
>>> It is obvious that providing a way to specify the resources that
>>> should be included in the cache digest will let clients generate more
>>> compact digest values.
>>>
>>> The downside is that it would be difficult for server administrators
>>> to _change_ a resource to become part of the digest.  Consider the
>>> case where a server has send resource A that is not being marked as
>>> part of the digest, and then the server administrator then changes the
>>> configuration for resource A to be included in part of the digest.
>>> The client will not include A in the digest it sends, since it is not
>>> marked.  The server will push the A to the client since it is not
>>> included in the digest.  (As discussed above) a client may discard the
>>> resource being pushed.  So A will continued to be pushed every time a
>>> new request is issued.
>>>
>>> Considering such possibility, it would be less troublesome if we could
>>> go without introducing a way to configure what should be included in
>>> the digest.
>>>
>>>
>>>> I can provide some data and some rough suggestions to address the issues above.
>>>>
>>>> How big would a cache digest be anyway?
>>>> -------------------------------------------------------
>>>>
>>>> To address issues 2 and 3 we need to determine how constrained we are
>>>> regarding space. We have made a little study[2] across 1300 sites
>>>> submitted by performance-conscious site operators, and from there we
>>>> can establish that while 50% of sites fetch between 25 and 110
>>>> resources, it is not too rare to have sites doing more than 200 HTTP
>>>> requests. If anything, that number is going to grow. Specially with
>>>> HTTP/2. Let's then use 200 as a ballpark estimate of the number of
>>>> items in a cache digest and start from there.
>>>>
>>>> The source that the draft includes for Golomb-coded-sets (GCS) hints
>>>> that it is possible to use the number of bits in a Bloom filter as an
>>>> upper bound for the size of the corresponding GCS. Therefore, with a
>>>> digest of size 200, we would be using an upper bound of 200*1.44*512
>>>> bits, which is around 18 kB is expressed as binary, and around 24 kB
>>>> if expressed in ascii form, base64-encoded, assuming a false positive
>>>> probability of 1/512.
>>>
>>> Your calculation is wrong.  A 200-entry GCS (with 1/512 false positive
>>> rate) will be slightly larger than 225 bytes (log2(512) * 200 bits) in
>>> binary form.
>>>
>>>> Notice that by using PUSH the browser may skip many of those requests.
>>>> In our site (https://www.shimmercat.com), we have measured HTTP/2
>>>> requests averaging at 60 bytes per request. Therefore, one may end up
>>>> saving up to 200 * 60 = 12  kB in traffic, bringing down the previous
>>>> numbers to 18 kB -12 kB =6 kB and 24 kB - 12 kB = 12 kB. I think that
>>>> 12 kB is acceptable for a site with 200 requests, specially since
>>>> HTTP/2 PUSH would greatly increase the data transfer density for those
>>>> sites.
>>>>
>>>>
>>>> Can we embed the cache digest in a header?
>>>> ------------------------------------------------------------
>>>>
>>>> Having 24 kB of cache digest in a header may delay processing the
>>>> request more than acceptable, since most servers will wait to get the
>>>> entire header block before starting to create an answer. There is an
>>>> alternative however, and that would be to put a  field with the cache
>>>> digest in a request trailer, allowed with chunked transfer under
>>>> HTTP/1.1 and in all streams with HTTP/2. The pros of having the cache
>>>> digest in a header or trailer field are the following:  we don't break
>>>> with the tradition of exchanging cache state through headers,  headers
>>>> are visible to developers' tools, it would be possible to test things
>>>> using polyfills and service workers while the browsers catch up with
>>>> native implementations, no extensions to HTTP/2 are needed, and cache
>>>> digests would become possible even over plain old HTTP/1.1. It can
>>>> also be made a little more future-proof:
>>>>
>>>> In the headers:
>>>>
>>>>       cache-digest: trailers
>>>>
>>>> (the indication above is not needed however if the cache-digest-scope
>>>> is used, see below)
>>>>
>>>> In the trailers:
>>>>
>>>>         cache-digest: data:application/golomb-coded-set;base64,.....
>>>>
>>>> The cons is that ascii is bigger than binary.
>>>>
>>>> Even if the CACHE_DIGEST frame is pursued, it would be nice to have
>>>>
>>>>         cache-digest: frame
>>>>
>>>> as part of the request (and this time in the headers section, not the
>>>> trailers) for the server to recognize that a cache digest frame is
>>>> coming and for developers to have a hint that said information is
>>>> being transmitted between client and server.
>>>>
>>>>
>>>> Distinguishing representation versions in the cache digest (Addressing point 1)
>>>>
>>>> ---------------------------------------------------------------------------------------------------------
>>>>
>>>> The GCS filter requires the client and the server to be able to
>>>> compute the same hash key for a given resource and version. As far as
>>>> I understand, having semantics here similar to if-modified-since would
>>>> not be possible. But strong etags could be used when computing the
>>>> key, therefore enabling the equivalent to if-none-match. Step 4 in the
>>>> algorithm of section 2.1 of the draft could be extended to have the
>>>> etag used together with the URL when taking the hash.
>>>>
>>>>
>>>> Which representations should be part of the digest? (Addressing point 4 and 5)
>>>> --------------------------------------------------------------------------------------------------------
>>>>
>>>> I suggest to introduce the concept of cache digest scope. Only
>>>> representations which were given a cache digest scope would be made
>>>> part of a cache digest. And the set of representations URLs to be
>>>> included by the client in the digest would be the intersection of:
>>>>
>>>> 1. The set of representations that have the same cache digest scope in
>>>> the browser's cache than the domain of the first request (the
>>>> document), and
>>>> 2. The set of representations in the browser's cache for which the
>>>> server is considered authoritative.
>>>>
>>>> The cache digest scope would be unique per domain.
>>>>
>>>> In other words, it would look like the following:
>>>>
>>>> Client asks for
>>>>                  https://www.example.com/
>>>> Server answers, and adds a header
>>>>  cache-digest-scope: example
>>>> The server then answers or pushes
>>>>   https://static.example.com/styles.css ,
>>>>        it uses the same header
>>>>           cache-digest-scope: example.
>>>>
>>>> The server also answers or pushes
>>>>   https://media.example.com/hero-1.png,
>>>>       but no cache-digest-scope is provided.
>>>>
>>>>
>>>> .... some time after, when a new connection is established by the same
>>>> client to fetch another page from the same domain:
>>>>
>>>> Client asks for (a different page)
>>>>         https://www.example.com/page1.html ,
>>>>        now the client specifies a header
>>>>       cache-digest-scope: example
>>>>        client also provides a cache digest with all
>>>>        the resources that were assigned
>>>>        the same cache digest scope by the server.
>>>>        That digest would include the resource from
>>>> https://static.example.com/styles.css
>>>>        but not the one at
>>>>                 https://media.example.com/hero-1.png
>>>>
>>>> The server answers and pushes a 304 not modified for
>>>> https://static.example.com/styles.css ,
>>>>        or a 200 with new contents, using a cache contents
>>>>        aware PUSH_PROMISE frame.
>>>>
>>>>
>>>> This  mechanism addresses 4 by allowing digests to extend over
>>>> multiple domains, and addresses 5 by allowing the server to control
>>>> which assets are part of the digest: resources *without* the
>>>> "cache-digest-scope" header are never made part of the digest. Also,
>>>> the holder of a wildcard certificate can still use it to host separate
>>>> multi-domain applications, for example (app1.example.com,
>>>> static1.example.com with cache digest scope "1") and
>>>> (app2.example.com, static2.example.com with cache digest scope "2"),
>>>> without fearing the cache digest to grow too big. Furthermore, if a
>>>> server doesn't implement PUSH or otherwise doesn't use the cache
>>>> digest, it implicitly opts out of cache digests,  saving bandwidth.
>>>>
>>>> The cache-digest-scope: xxxx header would be idem in most requests and
>>>> responses, and HPACK in HTTP/2 could compress it to a few bytes by
>>>> using the dynamic table.
>>>>
>>>> Best regards,
>>>>
>>>> ----
>>>> Alcides.
>>>>
>>>>
>>>> [1] http://blog.kazuhooku.com/2015/12/optimizing-performance-of-multi-tiered.html
>>>> [2] http://nbviewer.ipython.org/github/shimmercat/art_timings/blob/master/TimingsOfResourceLoads.ipynb
>>>>
>>>
>>>
>>>
>>> --
>>> Kazuho Oku
>>
>



-- 
Kazuho Oku
Received on Tuesday, 12 January 2016 21:34:04 UTC