- From: Garret Rieger <grieger@google.com>
- Date: Wed, 29 May 2019 16:13:17 -0700
- To: "Myles C. Maxfield" <mmaxfield@apple.com>
- Cc: "w3c-webfonts-wg (public-webfonts-wg@w3.org)" <public-webfonts-wg@w3.org>
- Message-ID: <CAM=OCWaS_yAS+MK8Cd3CKOXKynKfnx8Wk5DYwV=xwcY28hZ+xw@mail.gmail.com>
Thank you for the detailed feedback Myles! I’ve responded to the individual points inline below: - In our (the WebKit team’s) experience, optimizing for multiple metrics leads to confusion and unfortunate compromises made. Our task in this working group is to determine which (subset) of these solutions we should provide to the Web Platform. In order to do that, we really need a single fitness function, not three. Otherwise, it’s too difficult to make decisions. Compare this to all popular JavaScript benchmarks: many of them measure many diverse and independent things, but the final result is a single scalar. This gives us a strong signal about decisions we’ve made when implementing the engine. Yes, agree with this. However, I don’t think we’re quite yet at the stage where we want to boil things down to a single fitness function. I think we’re still in a somewhat exploratory phase where it will be helpful to see how each solutions performs across a few different possible metrics. These explorations could then help inform the development of a single fitness function (which may end up being a combination of the 3 metrics I proposed, perhaps just a single one, and/or possibly something other then what I proposed). If I had to pick a single fitness function, then currently I think it would look something like this: - Primarily based on the network delay worst case, with some consideration for total bytes transferred (particularly looking to penalize cases where total bytes transferred is excessively high. This could be modelled by looking for exceeding some % of the optimal transfer case). - Why is the whole benchmark posed as being per-font at the highest level? Most websites have many web fonts linked from a single page; Instead of running through the same sequence of pages again and again, and paying attention to a different font each time, we should measure a single sequence including all the fonts linked to from within. This is more accurate and will better reflect the user experience. My thinking is that even if there are multiple fonts on a page their loading experiences are largely independent of each other (assuming HTTP2, then the loads of individual families can happen completely in parallel). Therefore we can then simplify the analysis by focusing on a single font at a time. To illustrate let’s assume we’ve got data which has a walk across three pages: Page 1: uses fonts A and B Page 2: uses fonts C and B Page 3: uses fonts A, B, and C This can be transformed into 3 individual walks by focusing on one font at a time: Font A: Page 1 -> Page 3 (we only include codepoints in the pages that are styled by font A) Font B: Page 1 -> Page 2 -> Page 3 (we only include codepoints in the pages that are styled by font B) Font C: Page 2 -> Page 3 (we only include codepoints in the pages that are styled by font C) Thus in the overall analysis we’re still accounting for each of the fonts in a somewhat realistic fashion (as long as the above assumption that fonts load largely independently from each other holds.) - I don’t understand the mention of the 95th percentile. Can you explain how you came up with that idea? The idea is that we want to look at what the worst case behaviour is for a given solution instead of looking at the average (mean or median) performance. If we look at how font loading today works the first view incurs a heavy cost, while subsequent page views are practically free. If you average this out over many page views things look pretty good. However, the goal of progressive enrichment as I see it is too significantly reduce the cost of that first page view by shifting some of the data loading to subsequent page views. Overall the average number of bytes transferred stays pretty close to the same, but the worst case behaviour has been significantly improved. The 95th percentile is just a place-holder number I put out, but the intention is that we want to look at the longest font loads incurred by each solution as a measure for how well they perform. I propose we can measure this by looking at the tail of the distribution instead of the mean or median. I’m not completely happy with this approach and we’re still discussing some alternative approaches. Will definitely share more once we have some more concrete details. - Can you talk more about the graph of network requests? What would the graphs look like for the different proposals? For most solutions the graph is pretty simple. For example for the subset/patch method most page views (for the purpose of this analysis) would only have a single request and response. Where the graph would come into play I believe is with your proposed solution. My understanding is that we would first need to fetch the head of a font which gives us the table of contents for where glyph data is located and then make one or more subsequent requests to pull up needed segments of the glyf table. The graph in that case looks like so: Head load | +---------> glyf load 1 +---------> ... +---------> glyf load n All of the glyf load’s can happen in parallel so the graph is the head load with a bunch of edges going out to the indivdual glyf segment loads. The total network delay for this graph is: Head load time + max(glyph load times) I’m not super familiar with the specifics of Adobe’s current augmentation service but I believe it’s pretty similar to the subset/patch case. Finally, for a unicode range based strategy one or more font requests are needed which can all happen in parallel. So the graph is just a set of unconnected nodes and total load time is max(indivdual unicode range segement load times) - It appears that you’re attempting to model network latency with the formula "estimated network delay = RTT + node.bytes_transferred * delay_per_byte”. Instead of building a model, can’t we gather this data experimentally? We can’t spy on our users to gather random walks, but we can probably do better than purely theoretical modeling. In our experience, experimentally-gathered data, even if it only somewhat matches real user behavior, provides better results than trying to build a mathematical model that mirrors reality. Agreed that a more realistic model of network behaviour would be much better. Do you have any ideas on data sets that we could use here? - I don’t understand the “optimal transfer” function. What is this, and what is its value? The idea behind optimal transfer is, “what if we knew exactly what codepoints would be needed for all upcoming page views for a user?” then we could cut a subset that includes just those characters and supply that on the first page view. This subset represents the lower bound of number of bytes needed to be sent to the user for a particular sequence of page views. Practically this is not attainable with any method, but I think it’s valuable to have it as a point of comparison. We can then make assertions like “Method X” transfers Y% more bytes than the lower bound. - The “font families to test with” section seems very Google-Fonts-heavy. Many websites use many font loading solutions. I think we should be less partial when designing our corpus. Completely agree, I would love to see fonts or collections of fonts contributed by other WG members to be added to the corpus of test fonts for the analysis. - This document doesn’t include the most difficult part of this whole project: coming up with a set of pages that will model user behavior. Do you have any thoughts about how we could gather this corpus? (I have some thoughts about how we could do it, but I’d like to hear from you guys first in case you have a really great solution.) I agree that this is probably the more difficult part of the problem. We’re currently trying to chase down some potential data sets internally, but unfortunately I don’t have anything to share on that front yet and it may take awhile before I have any updates. As for alternatives, I have some very rough ideas which I haven’t sketched out in any detail: - Look at character frequencies across an index of the web, and then use those frequencies to create a model which outputs sample code point sets for a hypothetical sequence of page views. - Generate random page walks across an index of the web and use those. Probably will need to target the walks somewhat to ensure they stay confined to a set of pages from the same language. If you have any ideas of how we could tackle this then that would be helpful :). On Fri, May 24, 2019 at 7:28 PM Myles C. Maxfield <mmaxfield@apple.com> wrote: > > > On May 22, 2019, at 2:43 PM, Garret Rieger <grieger@google.com> wrote: > > I'd like to get discussion going on how we'd go about analyzing the > performance of the various proposed enrichment solutions so I wrote up a > short doc proposing one possible way of constructing an analysis: > > > https://docs.google.com/document/d/1kx62tpy5hGIbHh6tHMAryon9Sgye--W_IsHTeCMlmEo/edit?usp=sharing > > This is pretty early stages and I would love to hear your thoughts. You > can leave comments directly in the doc or reply on this thread. If there's > interest I suggest we discuss this further at tomorrow's working group > meeting. > > > Thanks for this! I think this is a really great step in the right > direction. > > A few questions / thoughts: > > > - In our (the WebKit team’s) experience, optimizing for multiple > metrics leads to confusion and unfortunate compromises made. Our task in > this working group is to determine which (subset) of these solutions we > should provide to the Web Platform. In order to do that, we really need a > single fitness function, not three. Otherwise, it’s too difficult to make > decisions. Compare this to all popular JavaScript benchmarks: many of them > measure many diverse and independent things, but the final result is a > single scalar. This gives us a strong signal about decisions we’ve made > when implementing the engine. > - Why is the whole benchmark posed as being per-font at the highest > level? Most websites have many web fonts linked from a single page; Instead > of running through the same sequence of pages again and again, and paying > attention to a different font each time, we should measure a single > sequence including all the fonts linked to from within. This is more > accurate and will better reflect the user experience. > - I don’t understand the mention of the 95th percentile. Can you > explain how you came up with that idea? > - Can you talk more about the graph of network requests? What would > the graphs look like for the different proposals? > - It appears that you’re attempting to model network latency with the > formula "estimated network delay = RTT + node.bytes_transferred * > delay_per_byte”. Instead of building a model, can’t we gather this data > experimentally? We can’t spy on our users to gather random walks, but we > can probably do better than purely theoretical modeling. In our experience, > experimentally-gathered data, even if it only somewhat matches real user > behavior, provides better results than trying to build a mathematical model > that mirrors reality. > - I don’t understand the “optimal transfer” function. What is this, > and what is its value? > - The “font families to test with” section seems very > Google-Fonts-heavy. Many websites use many font loading solutions. I think > we should be less partial when designing our corpus. > - This document doesn’t include the most difficult part of this whole > project: coming up with a set of pages that will model user behavior. Do > you have any thoughts about how we could gather this corpus? (I have some > thoughts about how we could do it, but I’d like to hear from you guys first > in case you have a really great solution.) > > > Thanks, > Myles >
Received on Wednesday, 29 May 2019 23:13:57 UTC