- From: Kazuho Oku <kazuhooku@gmail.com>
- Date: Tue, 19 Jul 2016 01:20:26 +0900
- To: Martin Thomson <martin.thomson@gmail.com>
- Cc: HTTP Working Group <ietf-http-wg@w3.org>
2016-07-18 22:27 GMT+09:00 Martin Thomson <martin.thomson@gmail.com>: > The draft describes how to select P to achieve a particular false > positive probability. The net effect of this is that you get > approximately the probability. > > I think that this would be better if there was a description of how to > decide on an upper bound for the false positive probability. > > The calculation is a little less clean: > > n is the number of resources > p is the lower bound of the probability of false positive > > P is the number of fixed bits allocated to each entry > N+P is the width of the hash output > > All of these are integers, except p, which is a number between 0 and 1. > > The current draft chooses: > > P = ??(0 - log2(p)) - it's not clear, but I imagine that this uses floor() > N = round(log2(n)) > > I propose that we instead do: > > P = round(0 - log2(p)) > N+P = ceil(log2(n)-log2(p)) > > This is, I think, a much better calculation. We would still benefit > from encoding N and P separately. Thank you for the suggestion. We need to encode N and P, since the two values are mandatory for correctly decoding the digest. That means that how the values are encoded must be defined in a normative form, as well as how the hashed values are encoded. On the other hand, how N and P should be selected not necessarily needs to be defined; it’s completely up to the client. Therefore, it might make more sense to define the process as an example. That said, I agree that it might make more sense to describe a process that uses the upper bound of the false-positive rate. However, if that is the case, we cannot calculate P directly from the false-positive rate, since P chosen as the reciprocal of the false-positive-rate becomes as half as small than it should be in case N is just below power-of-two (in such case, we would waste approx. 1 bit per entry). So, I think we should define the encoding process of digest-value as follows: In normative form, specify to: 1. decide appropriate N, P as well as constructing a list of sha256(url) values, each of them truncated to log2(N+P) bits and deduped 2. encode N, P, and the listed values using GCS and as a suggestion, describe the following algorithm for calculating the three values of above step 1: 1. if the url list is empty, set N=0, P=0 and return 2. set NP to round_up_to_power_of_two(num_urls / max_false_positive_rate) 3. create a list that contains sha256(url) of all the URLs, each truncated to log2(NP) bits 4. sort the list and remove the duplicates 5. set P to round_to_nearest_power_of_two(NP / size-of-the-list) 6. set N to NP / P The added bonus of not calculating P directly from false-positive rate are that i) it would emit more compact GCS for any false-positive rates not rounded to power-of-two (e.g. 1/100) and ii) it would emit more compact GCS in case false-positive rate is very high (in such case, many keys would be removed by dedupe, and that has an effect on the ideal value of P). -- Kazuho Oku
Received on Monday, 18 July 2016 16:21:00 UTC