Re: Submitted new I-D: Cache Digests for HTTP/2 from Kazuho Oku on 2016-01-09 (ietf-http-wg@w3.org from January to March 2016)

From: Kazuho Oku <kazuhooku@gmail.com>
Date: Sat, 9 Jan 2016 15:27:14 +0900
To: Alex Rousskov <rousskov@measurement-factory.com>
Cc: ietf-http-wg@w3.org
Message-ID: <CANatvzzywypKYN_T0mxYNzFs+AniwUt_gWV6WXEJ4oiuYWb1OQ@mail.gmail.com>
Hello,

2016-01-09 3:33 GMT+09:00 Alex Rousskov <rousskov@measurement-factory.com>:
> On 01/08/2016 12:17 AM, Kazuho Oku wrote:
>
>> Yesterday, Mark and I have submitted a new draft named "Cache Digests
>> for HTTP/2."
>> https://datatracker.ietf.org/doc/draft-kazuho-h2-cache-digest/
>>
>> The draft proposes a new HTTP/2 frame named CACHE_DIGEST that conveys
>> client's cache state so that a server can determine what should be
>> pushed to the client.
>
>> Please let us know how you think about the proposal.
>
>
> If possible, I recommend removing Draft language that makes (or appears
> to make) your feature specific to optimizing push traffic to user
> agents. Cache digests are useful for many things. Optimizing push
> traffic to user agents is just one use case. For example, Squid proxies
> already use Cache Digests (based on Bloom Filters) to optimize
> cache-to-cache communication in caching hierarchies [1,2].
>
>   [1] http://www.squid-cache.org/CacheDigest/cache-digest-v5.txt
>   [2] http://wiki.squid-cache.org/SquidFaq/CacheDigests
>
> I suspect it is possible to define the new CACHE_DIGEST frame without
> adding artificial restrictions on its use. Let the agents sending and
> receiving that frame decide what use is appropriate between them while
> following some general guidelines.
>
>
> Since there are already two cache digests formats (based on Bloom
> filters and based on Golumb-coded sets), we should expect a third one.
> Have you considered allocating the first few response octets to specify
> the digest format?

Thank you for the suggestion and the info.  I did not know that Squid
was using Bloom filters for such purpose.

If we are to generalize the proposal to support other purposes such as
exchanging cache states between proxies, I think we should also
consider of defining a way for sending a digest divided into multiple
HTTP/2 frames in case the size of the digest exceeds 16KB, in addition
to providing space to define which encoding is being used.

Or if there is no immediate demand to use an encoding other than
Golomb-coded sets for sending a small-sized digest, then we can add a
sentence stating that:

* sender of a CACHE_DIGEST frame must set its flags to zero
* receiver of the frame must ignore if its flags are not set to zero

, and if such demand arises, define new flags to extend the semantics.

For the purpose of optimizing HTTP/2 push, my assumption has been that
it is unlikely for a client implementation to want to send a digest
that does not fit into INITCWND (therefore the proposal does not
define a way to send multi-frame digest).  Also, Golomb-coded sets
will be the only practical choice, the size of the digest will become
significantly larger if Bloom filter was chosen (in case false
positive rate is set to 1/256, it will be about 8x as large).

>> A CACHE_DIGEST frame can be sent from a client to a server on any
>>    stream in the "open" state, and conveys a digest of the contents of
>>    the cache associated with that stream
>
> Perhaps I am missing some important HTTP/2-derived limits here, but the
> "cache associated with a stream" sounds too vague because HTTP caches
> are often not associated with specific streams. Did you mean something
> like "the cache portion containing shared-origin URIs?"

Yes.

>>    servers ought not
>>    expect frequent updates; instead, if they wish to continue to utilise
>>    the digest, they will need update it with responses sent to that
>>    client on the connection.
>
> Perhaps I am missing some important HTTP/2 caveats here, but how would
> an origin server identify "that client" when the "connection" is coming
> from a proxy and multiplexes responses to many user agents?

Proxies implementing HTTP/2 ignores extensions that it does not understand.

Therefore, it wouldn't be a problem if a client sends a CACHE_DIGEST
frame to a HTTP/2 proxy that does not understand it; it will simply be
ignored.

Proxies understanding the frame can simply transfer it to the upstream
server; or it can merge the status of its cache into the digest before
sending it upstream in order to save the bandwidth.

>>        1.  Convert "URL" to an ASCII string by percent-encoding as
>>            appropriate [RFC3986].
>
> There are many ways to percent-encode the same URI. This step must
> define a single way for doing so. Besides case insensitive parts and the
> decision of what characters to [un]escape, please do not forget about
> trailing slashes, URI fragments, and other optional parts. This is
> critical for interoperation!

Thank you for pointing that out.

RFC 3986 section 6 defines how to normalize URIs.  I agree that we
should state explicitly which section should be applied before
calculating the hash.

IMO section 6.2.2 (Syntax-Based Normalization) and 6.2.3 (Scheme-Based
Normalization) are the ones that must be applied.

>>    MUST choose a parameter, "P",
>>    that indicates the probability of a false positive it is willing to
>>    tolerate
>
> For clarity, please detail what you mean by a "false positive" in this
> context. It may also be useful to mention whether the digesting
> algorithm may create false negatives.
>
>
>> 7.  Write log base 2 of "N" and "P" to "digest" as octets.
>
> The wording is ambiguous: Store log2(N) and then store log2(P)? Store
> log2(N&P)? Store log2(N) and then store P? I suspect it is the latter
> and recommend splitting step #7 into two steps, one step per number.

It's log2(N) and then log2(P).

I agree to your recommendation to split the description into two steps.

> BTW, why note store the actual value of N?

In order to disallow the use N that is not power of 2.

As show in step 4, we need to calculate sha256(url) modulo (N*P) for
every URL.  The calculation is not only required on the client side
for generating the digest, but also on the server side on determining
if a given URL is already likely to be cached by the server.

If both N (in addition to P) is a power of two, then the modulo
operation can be a bit operation.  But if not, it will be very
computer intensive, might become an attack vector.

>> 7.  Write log base 2 of "N" and "P" to "digest" as octets.
> ...
>> 8.  Write "R" to "digest" as binary, using log2(P) bits.
>
> It is not clear how a number should be written/encoded. Different
> programming languages and different systems store/represent numbers
> differently, so I would expect the Draft to specify encoding precisely.
> Sorry if I missed that detail.
>
>
> The draft appears to be missing a section documenting how the digest
> recipient can test whether the digest contains a given URI.

Thank you for the advise.  I agree that the description on how to
encode the bits should be more precise (probably like that of HPACK),
and that how to use the digest should be explained.

> Please consider an additional Security Consideration: Origin servers are
> expected to store digests so that the stored digests can be consulted
> when pushing traffic. Most origin servers will store digests in RAM. A
> malicious client may send a huge digest as a form of a DoS attack on a
> naive server that does not validate digest sizes. Malicious client(s)
> may send many small digests as a form of a (D)DoS attack on a naive
> server that do not control the total size of stored digests.

Thank you for bringing it up.  I agree that such statement should exist.

> Thank you,
>
> Alex.
>



-- 
Kazuho Oku
Received on Saturday, 9 January 2016 06:32:56 UTC