- From: Kazuho Oku <kazuhooku@gmail.com>
- Date: Wed, 13 Jan 2016 06:33:32 +0900
- To: Stefan Eissing <stefan.eissing@greenbytes.de>
- Cc: HTTP Working Group <ietf-http-wg@w3.org>
2016-01-13 1:31 GMT+09:00 Stefan Eissing <stefan.eissing@greenbytes.de>: > Kazuho, > > while pondering my implementation: ch. 2.1 step 8 is different from the algorithm described in http://giovanni.bajo.it/post/47119962313/golomb-coded-sets-smaller-than-bloom-filters > > Your draft calculates (as I read it, I could be wrong) > A. for i in 0..len-2; do D:="hash-values[i+1] - hash-values[i] - 1"; ....; done > > while the latter does > B. for i in 0..len-1; do D:="hash-values[i] - (i > 0)? hash-values[i-1] : 0)"; ....; done > > I am not sure, on decompression, how to obtain the first hash value back from A. In B it just is the first delta. Could you give me a hint? The algorithm intended here is as follows. Please let me check the spec. uint64_t next_min = 0; for (i = 0; i < hash_values.length; ++i) { D = hash_values[i] - next_min; ... next_min = hash_values[i] + 1; } The corresponding code in my implementation is https://github.com/kazuho/golombset/blob/e7cae04/golombset.h#L154-L158 > Thanks! > > -Stefan > >> Am 12.01.2016 um 10:01 schrieb Alcides Viamontes E <alcidesv@zunzun.se>: >> >> Hello, >> >> Thanks for your response Kazuho. I need to do some thinking before >> fully addressing your comments, since you add valuable information >> that I didn't consider before. At the risk of adding some noise >> (please forgive me for that), I will write a few quick remarks: >> >>> Your calculation is wrong. A 200-entry GCS (with 1/512 false positive >>> rate) will be slightly larger than 225 bytes (log2(512) * 200 bits) in >>> binary form. >> >> That's good news! My Google's cookie is 1246 bytes long. If we are not >> talking about several kilobytes, then the restrictions are less. >> >> I would like to know more about the expectations for intermediaries. >> As of today, HTTP/2 is working with TLS, therefore the website >> operator needs to bless the HTTP/2 edge server with the SSL >> certificate's private key. Maybe we can hope that website operators >> will choose an HTTP/2 stack that does what he/she intends? In other >> words, I think we shouldn't make the spec more difficult to use just >> to accommodate potential issues with intermediaries. In that light, we >> could perhaps require intermediaries to use a cache digest mindfully. >> In any case please forgive me for my lack of data, this is just food >> for thoughts and I will be glad to know more about how things look >> right now with HTTP/2 intermediaries and caches. >> >> I will write a more detailed follow-up a few weeks from now. I will >> also try to make a polyfill implementation using service workers and >> ShimmerCat to learn how this looks in practice. >> >> Bests, >> >> ./Alcides. >> >> >> On Tue, Jan 12, 2016 at 2:04 AM, Kazuho Oku <kazuhooku@gmail.com> wrote: >>> 2016-01-11 2:11 GMT+09:00 Alcides Viamontes E <alcidesv@zunzun.se>: >>>> Hello, >>>> >>>> My interest in the draft "Cache Digests for HTTP/2" >>>> https://datatracker.ietf.org/doc/draft-kazuho-h2-cache-digest/ >>>> concerns the original, intended use case that Mr. Kakuho Oku and Mr. >>>> M. Nottingham cited. As the authors, I would like very much like to >>>> see this made a standard and implemented in browsers. However, I >>>> perceive a few issues. Beforehand, I apologize for this long email, >>>> for any gaps in my understanding of the subject, and for not being >>>> familiar with the language and procedures used in this list. >>> >>> Thank you for your feedback! >>> >>>> Here are the issues that I see: >>>> >>>> 1.- In its current wording, no information about which version of a >>>> representation the browser already has is present in the cache digest. >>>> That information can be included in the URL itself (cache busting), >>>> but then it becomes a concern for web-developers, adds complexity to >>>> their work, and bypasses the mechanisms that HTTP has in place for >>>> maintaining cache state. It also increases space pressure in the the >>>> browser's cache as the server is left with no means to expire old >>>> cached contents in the browser. >>> >>> That is a very good point. >>> >>> Let me first discuss the restrictions of the cache model used by HTTP, >>> and then go on to discuss what we should do if we are to fix the point >>> you raised. >>> >>> First about the restriction; the resources in the cache can be divided >>> into two groups: fresh and non-fresh. A server should never push a >>> resource that is considered as fresh in the client's cache. Clients >>> will not notice the push / the HTTP/2 allows client to discard such >>> push. Therefore, a CACHE_DIGEST frame >>> must include a filter that marks the resources that are marked as >>> being fresh. That is what the current draft specifies. >>> >>> Next about the point of including version information (e.g. >>> Last-Modified, ETag) in the cache digest. I believe we can add a >>> second Golomb-coded set to the frame that uses hash(URI + version >>> information) as the key. A server can refer to the information to >>> determine whether if it should push a 304 response or a 200 response. >>> >>> The downside is that the CACHE_DIGEST frame may become larger (if the >>> server sends many responses that would become non-fresh), so it might >>> be sensible to allow the client to decide if it should send the second >>> Golomb-coded set. >>> >>> In addition, we should agree on how to push 304 response. My >>> understanding is that HTTP/2 spec., is vague on this, and that there >>> has not yet been an agreement between the client developers on how it >>> should be done. >>> >>> Once that is solved, I think we should update the I-D to cover the >>> version information as well. >>> >>>> 2.- There is no way for the server to know that a CACHE_DIGEST frame >>>> is coming immediately after a HEADERS frame. A server may trigger some >>>> processing already after the end of headers has been received, while >>>> making further DATA frames available as a stream of data to the >>>> application. With CACHE_DIGEST frames, the cache aware server will >>>> have to delay processing until the end of the stream has been seen to >>>> be sure that no CACHE_DIGEST frame is coming, or would have to >>>> re-start processing on seeing the frame. Arguably this is not a big >>>> problem for GET requests with an empty body, but it would be nice if >>>> the spec didn't force the server to wait for the end of the stream. >>> >>> Agreed. >>> >>> There are three options here (the draft adopts option C): >>> >>> a) send CACHE_DIGEST frame right before HEADERS >>> b) send CACHE_DIGEST frame at stream_id=zero, with the value of the >>> authority that should be associated to the digest included within the >>> frame >>> c) send CACHE_DIGEST frame right after HEADERS >>> >>> B is clearly the easiest but would have a small impact on the consumed >>> bandwidth, since the authority needs to be sent separately. >>> >>> In A, the server does not need to delay the processing of the request, >>> but needs to cache the value of the digest. >>> >>> It would be great to discuss which of the three approach will be the >>> best solution in general (or if there could be other approaches). >>> >>>> 3. - Traditionally, cache state information has been placed in HTTP >>>> header fields. A CACHE_DIGEST frame puts some of that information in a >>>> new place, which is sure to cause some pain to web developers and >>>> sys-admins trying to understand the behavior of their applications. >>> >>> CACHE_DIGEST frame should not be a HTTP header, since including the >>> value in every HTTP request (as a header) will make the HTTP requests >>> huge. Since the client's cache state changes as the server sends >>> responses, we cannot expect HPACK to effectively compress the >>> requests. We should send cache digest only once per HTTP/2 >>> connection. >>> (note that intermediaries are allowed to re-order the HTTP requests >>> sent from a client, so it is impossible to include the digest only in >>> the first HTTP request as a header). >>> >>> The other reason is that the digest should be hop-by-hop. The default >>> behavior of a proxy (that do not understand the extension) should be >>> to drop the digest. >>> >>>> 4.- The draft assumes a somewhat more restricted scope of Push than >>>> allowed by the HTTP/2 spec, RFC7540, and to some extent, goes against >>>> current practice. Section 8.2 of RFC7540, "Server Push", says "The >>>> server MUST include a value in the :authority" pseudo-header field for >>>> which the server is authoritative". Section 10.1 defines server >>>> authority by referring to [RFC7230], Section 9.1. For the HTTPS case, >>>> a server is authoritative for a domain if it can present a certificate >>>> that covers that domain. To the point, RFC7540 does not forbids a >>>> server to push resources for different domains, provided that it has >>>> the right credentials. Pushing assets for a domain different than the >>>> one where the request is received is useful when considering the way >>>> web applications are structured today: many serve their application >>>> logic using a www.example.com domain, while serving their static >>>> assets at static.example.com . Therefore, upon receiving a request to >>>> www.example.com, a server may want to push resources for >>>> static.example.com. However, section 2.1 of the draft works against >>>> that use case. >>> >>> Thank you for pointing that out. >>> >>> I think that for plaintext HTTP we agree that the client needs to >>> associate the name of the authority to the digest that it sends >>> (including one of the three options discussed above). >>> >>> Considering the case for HTTPS, may be we should better allow the >>> client whether or not to associate an authority. >>> >>>> 5.- A last issue has to do with what to include in the cache digest. >>>> Mr. Oku proposes to only push resources which are in the critical >>>> render path in his article at [1]. Correspondingly, the cache digest >>>> would only need to include those resources. Can we have a simple >>>> mechanism to control the cache digest contents? >>> >>> It is obvious that providing a way to specify the resources that >>> should be included in the cache digest will let clients generate more >>> compact digest values. >>> >>> The downside is that it would be difficult for server administrators >>> to _change_ a resource to become part of the digest. Consider the >>> case where a server has send resource A that is not being marked as >>> part of the digest, and then the server administrator then changes the >>> configuration for resource A to be included in part of the digest. >>> The client will not include A in the digest it sends, since it is not >>> marked. The server will push the A to the client since it is not >>> included in the digest. (As discussed above) a client may discard the >>> resource being pushed. So A will continued to be pushed every time a >>> new request is issued. >>> >>> Considering such possibility, it would be less troublesome if we could >>> go without introducing a way to configure what should be included in >>> the digest. >>> >>> >>>> I can provide some data and some rough suggestions to address the issues above. >>>> >>>> How big would a cache digest be anyway? >>>> ------------------------------------------------------- >>>> >>>> To address issues 2 and 3 we need to determine how constrained we are >>>> regarding space. We have made a little study[2] across 1300 sites >>>> submitted by performance-conscious site operators, and from there we >>>> can establish that while 50% of sites fetch between 25 and 110 >>>> resources, it is not too rare to have sites doing more than 200 HTTP >>>> requests. If anything, that number is going to grow. Specially with >>>> HTTP/2. Let's then use 200 as a ballpark estimate of the number of >>>> items in a cache digest and start from there. >>>> >>>> The source that the draft includes for Golomb-coded-sets (GCS) hints >>>> that it is possible to use the number of bits in a Bloom filter as an >>>> upper bound for the size of the corresponding GCS. Therefore, with a >>>> digest of size 200, we would be using an upper bound of 200*1.44*512 >>>> bits, which is around 18 kB is expressed as binary, and around 24 kB >>>> if expressed in ascii form, base64-encoded, assuming a false positive >>>> probability of 1/512. >>> >>> Your calculation is wrong. A 200-entry GCS (with 1/512 false positive >>> rate) will be slightly larger than 225 bytes (log2(512) * 200 bits) in >>> binary form. >>> >>>> Notice that by using PUSH the browser may skip many of those requests. >>>> In our site (https://www.shimmercat.com), we have measured HTTP/2 >>>> requests averaging at 60 bytes per request. Therefore, one may end up >>>> saving up to 200 * 60 = 12 kB in traffic, bringing down the previous >>>> numbers to 18 kB -12 kB =6 kB and 24 kB - 12 kB = 12 kB. I think that >>>> 12 kB is acceptable for a site with 200 requests, specially since >>>> HTTP/2 PUSH would greatly increase the data transfer density for those >>>> sites. >>>> >>>> >>>> Can we embed the cache digest in a header? >>>> ------------------------------------------------------------ >>>> >>>> Having 24 kB of cache digest in a header may delay processing the >>>> request more than acceptable, since most servers will wait to get the >>>> entire header block before starting to create an answer. There is an >>>> alternative however, and that would be to put a field with the cache >>>> digest in a request trailer, allowed with chunked transfer under >>>> HTTP/1.1 and in all streams with HTTP/2. The pros of having the cache >>>> digest in a header or trailer field are the following: we don't break >>>> with the tradition of exchanging cache state through headers, headers >>>> are visible to developers' tools, it would be possible to test things >>>> using polyfills and service workers while the browsers catch up with >>>> native implementations, no extensions to HTTP/2 are needed, and cache >>>> digests would become possible even over plain old HTTP/1.1. It can >>>> also be made a little more future-proof: >>>> >>>> In the headers: >>>> >>>> cache-digest: trailers >>>> >>>> (the indication above is not needed however if the cache-digest-scope >>>> is used, see below) >>>> >>>> In the trailers: >>>> >>>> cache-digest: data:application/golomb-coded-set;base64,..... >>>> >>>> The cons is that ascii is bigger than binary. >>>> >>>> Even if the CACHE_DIGEST frame is pursued, it would be nice to have >>>> >>>> cache-digest: frame >>>> >>>> as part of the request (and this time in the headers section, not the >>>> trailers) for the server to recognize that a cache digest frame is >>>> coming and for developers to have a hint that said information is >>>> being transmitted between client and server. >>>> >>>> >>>> Distinguishing representation versions in the cache digest (Addressing point 1) >>>> >>>> --------------------------------------------------------------------------------------------------------- >>>> >>>> The GCS filter requires the client and the server to be able to >>>> compute the same hash key for a given resource and version. As far as >>>> I understand, having semantics here similar to if-modified-since would >>>> not be possible. But strong etags could be used when computing the >>>> key, therefore enabling the equivalent to if-none-match. Step 4 in the >>>> algorithm of section 2.1 of the draft could be extended to have the >>>> etag used together with the URL when taking the hash. >>>> >>>> >>>> Which representations should be part of the digest? (Addressing point 4 and 5) >>>> -------------------------------------------------------------------------------------------------------- >>>> >>>> I suggest to introduce the concept of cache digest scope. Only >>>> representations which were given a cache digest scope would be made >>>> part of a cache digest. And the set of representations URLs to be >>>> included by the client in the digest would be the intersection of: >>>> >>>> 1. The set of representations that have the same cache digest scope in >>>> the browser's cache than the domain of the first request (the >>>> document), and >>>> 2. The set of representations in the browser's cache for which the >>>> server is considered authoritative. >>>> >>>> The cache digest scope would be unique per domain. >>>> >>>> In other words, it would look like the following: >>>> >>>> Client asks for >>>> https://www.example.com/ >>>> Server answers, and adds a header >>>> cache-digest-scope: example >>>> The server then answers or pushes >>>> https://static.example.com/styles.css , >>>> it uses the same header >>>> cache-digest-scope: example. >>>> >>>> The server also answers or pushes >>>> https://media.example.com/hero-1.png, >>>> but no cache-digest-scope is provided. >>>> >>>> >>>> .... some time after, when a new connection is established by the same >>>> client to fetch another page from the same domain: >>>> >>>> Client asks for (a different page) >>>> https://www.example.com/page1.html , >>>> now the client specifies a header >>>> cache-digest-scope: example >>>> client also provides a cache digest with all >>>> the resources that were assigned >>>> the same cache digest scope by the server. >>>> That digest would include the resource from >>>> https://static.example.com/styles.css >>>> but not the one at >>>> https://media.example.com/hero-1.png >>>> >>>> The server answers and pushes a 304 not modified for >>>> https://static.example.com/styles.css , >>>> or a 200 with new contents, using a cache contents >>>> aware PUSH_PROMISE frame. >>>> >>>> >>>> This mechanism addresses 4 by allowing digests to extend over >>>> multiple domains, and addresses 5 by allowing the server to control >>>> which assets are part of the digest: resources *without* the >>>> "cache-digest-scope" header are never made part of the digest. Also, >>>> the holder of a wildcard certificate can still use it to host separate >>>> multi-domain applications, for example (app1.example.com, >>>> static1.example.com with cache digest scope "1") and >>>> (app2.example.com, static2.example.com with cache digest scope "2"), >>>> without fearing the cache digest to grow too big. Furthermore, if a >>>> server doesn't implement PUSH or otherwise doesn't use the cache >>>> digest, it implicitly opts out of cache digests, saving bandwidth. >>>> >>>> The cache-digest-scope: xxxx header would be idem in most requests and >>>> responses, and HPACK in HTTP/2 could compress it to a few bytes by >>>> using the dynamic table. >>>> >>>> Best regards, >>>> >>>> ---- >>>> Alcides. >>>> >>>> >>>> [1] http://blog.kazuhooku.com/2015/12/optimizing-performance-of-multi-tiered.html >>>> [2] http://nbviewer.ipython.org/github/shimmercat/art_timings/blob/master/TimingsOfResourceLoads.ipynb >>>> >>> >>> >>> >>> -- >>> Kazuho Oku >> > -- Kazuho Oku
Received on Tuesday, 12 January 2016 21:34:04 UTC