Re: Proposal For Analysis Framework from Myles C. Maxfield on 2019-05-25 (public-webfonts-wg@w3.org from May 2019)

From: Myles C. Maxfield <mmaxfield@apple.com>
Date: Fri, 24 May 2019 19:28:19 -0700
To: Garret Rieger <grieger@google.com>
Cc: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
Message-id: <AB6900BC-115C-43E1-88E7-F9733C925883@apple.com>

> On May 22, 2019, at 2:43 PM, Garret Rieger <grieger@google.com> wrote:
> 
> I'd like to get discussion going on how we'd go about analyzing the performance of the various proposed enrichment solutions so I wrote up a short doc proposing one possible way of constructing an analysis:
> 
> https://docs.google.com/document/d/1kx62tpy5hGIbHh6tHMAryon9Sgye--W_IsHTeCMlmEo/edit?usp=sharing <https://docs.google.com/document/d/1kx62tpy5hGIbHh6tHMAryon9Sgye--W_IsHTeCMlmEo/edit?usp=sharing>
> 
> This is pretty early stages and I would love to hear your thoughts. You can leave comments directly in the doc or reply on this thread. If there's interest I suggest we discuss this further at tomorrow's working group meeting.

Thanks for this! I think this is a really great step in the right direction.

A few questions / thoughts:

In our (the WebKit team’s) experience, optimizing for multiple metrics leads to confusion and unfortunate compromises made. Our task in this working group is to determine which (subset) of these solutions we should provide to the Web Platform. In order to do that, we really need a single fitness function, not three. Otherwise, it’s too difficult to make decisions. Compare this to all popular JavaScript benchmarks: many of them measure many diverse and independent things, but the final result is a single scalar. This gives us a strong signal about decisions we’ve made when implementing the engine.
Why is the whole benchmark posed as being per-font at the highest level? Most websites have many web fonts linked from a single page; Instead of running through the same sequence of pages again and again, and paying attention to a different font each time, we should measure a single sequence including all the fonts linked to from within. This is more accurate and will better reflect the user experience.
I don’t understand the mention of the 95th percentile. Can you explain how you came up with that idea?
Can you talk more about the graph of network requests? What would the graphs look like for the different proposals?
It appears that you’re attempting to model network latency with the formula "estimated network delay = RTT + node.bytes_transferred * delay_per_byte”. Instead of building a model, can’t we gather this data experimentally? We can’t spy on our users to gather random walks, but we can probably do better than purely theoretical modeling. In our experience, experimentally-gathered data, even if it only somewhat matches real user behavior, provides better results than trying to build a mathematical model that mirrors reality.
I don’t understand the “optimal transfer” function. What is this, and what is its value?
The “font families to test with” section seems very Google-Fonts-heavy. Many websites use many font loading solutions. I think we should be less partial when designing our corpus.
This document doesn’t include the most difficult part of this whole project: coming up with a set of pages that will model user behavior. Do you have any thoughts about how we could gather this corpus? (I have some thoughts about how we could do it, but I’d like to hear from you guys first in case you have a really great solution.)

Thanks,
Myles

Received on Saturday, 25 May 2019 02:28:47 UTC