# Re: Calculation of the hash width in cache-digests

```2016-07-18 22:27 GMT+09:00 Martin Thomson <martin.thomson@gmail.com>:
> The draft describes how to select P to achieve a particular false
> positive probability.  The net effect of this is that you get
> approximately the probability.
>
> I think that this would be better if there was a description of how to
> decide on an upper bound for the false positive probability.
>
> The calculation is a little less clean:
>
> n is the number of resources
> p is the lower bound of the probability of false positive
>
> P is the number of fixed bits allocated to each entry
> N+P is the width of the hash output
>
> All of these are integers, except p, which is a number between 0 and 1.
>
> The current draft chooses:
>
> P = ??(0 - log2(p)) - it's not clear, but I imagine that this uses floor()
> N = round(log2(n))
>
> I propose that we instead do:
>
> P = round(0 - log2(p))
> N+P = ceil(log2(n)-log2(p))
>
> This is, I think, a much better calculation.  We would still benefit
> from encoding N and P separately.

Thank you for the suggestion.

We need to encode N and P, since the two values are mandatory for
correctly decoding the digest. That means that how the values are
encoded must be defined in a normative form, as well as how the hashed
values are encoded.

On the other hand, how N and P should be selected not necessarily
needs to be defined; it’s completely up to the client. Therefore, it
might make more sense to define the process as an example.

That said, I agree that it might make more sense to describe a process
that uses the upper bound of the false-positive rate. However, if that
is the case, we cannot calculate P directly from the false-positive
rate, since P chosen as the reciprocal of the false-positive-rate
becomes as half as small than it should be in case N is just below
power-of-two (in such case, we would waste approx. 1 bit per entry).

So, I think we should define the encoding process of digest-value as follows:

In normative form, specify to:

1. decide appropriate N, P as well as constructing a list of
sha256(url) values, each of them truncated to log2(N+P) bits and
deduped
2. encode N, P, and the listed values using GCS

and as a suggestion, describe the following algorithm for calculating
the three values of above step 1:

1. if the url list is empty, set N=0, P=0 and return
2. set NP to round_up_to_power_of_two(num_urls / max_false_positive_rate)
3. create a list that contains sha256(url) of all the URLs, each
truncated to log2(NP) bits
4. sort the list and remove the duplicates
5. set P to round_to_nearest_power_of_two(NP / size-of-the-list)
6. set N to NP / P

The added bonus of not calculating P directly from false-positive rate
are that i) it would emit more compact GCS for any false-positive
rates not rounded to power-of-two (e.g. 1/100) and ii) it would emit
more compact GCS in case false-positive rate is very high (in such
case, many keys would be removed by dedupe, and that has an effect on
the ideal value of P).

--
Kazuho Oku
```

Received on Monday, 18 July 2016 16:21:00 UTC