- From: Aryeh Gregor <ayg@aryeh.name>
- Date: Wed, 23 Nov 2011 10:03:52 -0500
- To: Boris Zbarsky <bzbarsky@mit.edu>
- Cc: Ojan Vafai <ojan@chromium.org>, Ian Hickson <ian@hixie.ch>, "Tab Atkins Jr." <jackalmage@gmail.com>, public-webapps@w3.org
On Tue, Nov 22, 2011 at 1:04 PM, Boris Zbarsky <bzbarsky@mit.edu> wrote: > Again, some decent data on what pages actually do in on* handlers would be > really good. I have no idea how to get it. :( Can't browsers add instrumentation for this? You have users who have opted in to sending anonymized data. So for each user, on a small percentage of pages, intercept all bare-name property accesses in on*. Record the property name, and which object in the scope chain it wound up resolving to. Send info back to mothership. There will be some perf impact, but it should be no big deal if you only do it a small percentage of the time for each user. Of course, it might require a bunch of work to actually code this kind of thing -- that I'm not in a position to judge. Moving forward, this kind of info-gathering will be really essential for us to figure out how we can change stuff. Right now we have to be super-conservative when making changes because we have no idea in advance what impact they'll have. This is not a good thing for the web platform, IMO. (Aside: If we're just looking at some binary question like whether a specific name like "matches" is doable, you should be able to do this even without user opt-in, with no privacy breach. Just send back noise with probability (n - 1)/n, and the real value with probability 1/n, for n fairly large (say 100,000). Then average all the values together, subtract (n - 1)/n times the mean of the distribution you picked the noise values from, multiply by n, and you get something very close to the true average, by the law of large numbers. E.g., if the data is a bit, send a random bit 99.999% of the time and the real value 0.001% of the time. Average all the values, subtract 0.499995, multiply by 100,000, and you have roughly the true average (error bars easily calculable). But the bit sent back by any given user would yield negligible information about that user to either the browser vendor or an eavesdropper, because it's almost surely noise. The same approach would work for any value, provided you can come up with a plausible distribution for the noise -- which is almost certainly not the case for string values, say. This would all have to be reviewed by security teams, but it should be doable in principle. The advantage is your sample would actually be representative, which could be important in some cases.)
Received on Wednesday, 23 November 2011 15:04:50 UTC