Re: Calculation of the hash width in cache-digests

2016-07-18 22:27 GMT+09:00 Martin Thomson <>:
> The draft describes how to select P to achieve a particular false
> positive probability.  The net effect of this is that you get
> approximately the probability.
> I think that this would be better if there was a description of how to
> decide on an upper bound for the false positive probability.
> The calculation is a little less clean:
> n is the number of resources
> p is the lower bound of the probability of false positive
> P is the number of fixed bits allocated to each entry
> N+P is the width of the hash output
> All of these are integers, except p, which is a number between 0 and 1.
> The current draft chooses:
> P = ??(0 - log2(p)) - it's not clear, but I imagine that this uses floor()
> N = round(log2(n))
> I propose that we instead do:
> P = round(0 - log2(p))
> N+P = ceil(log2(n)-log2(p))
> This is, I think, a much better calculation.  We would still benefit
> from encoding N and P separately.

Thank you for the suggestion.

We need to encode N and P, since the two values are mandatory for
correctly decoding the digest. That means that how the values are
encoded must be defined in a normative form, as well as how the hashed
values are encoded.

On the other hand, how N and P should be selected not necessarily
needs to be defined; it’s completely up to the client. Therefore, it
might make more sense to define the process as an example.

That said, I agree that it might make more sense to describe a process
that uses the upper bound of the false-positive rate. However, if that
is the case, we cannot calculate P directly from the false-positive
rate, since P chosen as the reciprocal of the false-positive-rate
becomes as half as small than it should be in case N is just below
power-of-two (in such case, we would waste approx. 1 bit per entry).

So, I think we should define the encoding process of digest-value as follows:

In normative form, specify to:

1. decide appropriate N, P as well as constructing a list of
sha256(url) values, each of them truncated to log2(N+P) bits and
2. encode N, P, and the listed values using GCS

and as a suggestion, describe the following algorithm for calculating
the three values of above step 1:

1. if the url list is empty, set N=0, P=0 and return
2. set NP to round_up_to_power_of_two(num_urls / max_false_positive_rate)
3. create a list that contains sha256(url) of all the URLs, each
truncated to log2(NP) bits
4. sort the list and remove the duplicates
5. set P to round_to_nearest_power_of_two(NP / size-of-the-list)
6. set N to NP / P

The added bonus of not calculating P directly from false-positive rate
are that i) it would emit more compact GCS for any false-positive
rates not rounded to power-of-two (e.g. 1/100) and ii) it would emit
more compact GCS in case false-positive rate is very high (in such
case, many keys would be removed by dedupe, and that has an effect on
the ideal value of P).

Kazuho Oku

Received on Monday, 18 July 2016 16:21:00 UTC