Re: Proposal For Analysis Framework

Thank you for the detailed feedback Myles! I’ve responded to the individual
points inline below:


   -

   In our (the WebKit team’s) experience, optimizing for multiple metrics
   leads to confusion and unfortunate compromises made. Our task in this
   working group is to determine which (subset) of these solutions we should
   provide to the Web Platform. In order to do that, we really need a single
   fitness function, not three. Otherwise, it’s too difficult to make
   decisions. Compare this to all popular JavaScript benchmarks: many of them
   measure many diverse and independent things, but the final result is a
   single scalar. This gives us a strong signal about decisions we’ve made
   when implementing the engine.

Yes, agree with this. However, I don’t think we’re quite yet at the stage
where we want to boil things down to a single fitness function. I think
we’re still in a somewhat exploratory phase where it will be helpful to see
how each solutions performs across a few different possible metrics. These
explorations could then help inform the development of a single fitness
function (which may end up being a combination of the 3 metrics I proposed,
perhaps just a single one, and/or possibly something other then what I
proposed).

If I had to pick a single fitness function, then currently I think it would
look something like this:

   -

   Primarily based on the network delay worst case, with some consideration
   for total bytes transferred (particularly looking to penalize cases where
   total bytes transferred is excessively high. This could be modelled by
   looking for exceeding some % of the optimal transfer case).



   -

   Why is the whole benchmark posed as being per-font at the highest level?
   Most websites have many web fonts linked from a single page; Instead of
   running through the same sequence of pages again and again, and paying
   attention to a different font each time, we should measure a single
   sequence including all the fonts linked to from within. This is more
   accurate and will better reflect the user experience.

My thinking is that even if there are multiple fonts on a page their
loading experiences are largely independent of each other (assuming HTTP2,
then the loads of individual families can happen completely in parallel).
Therefore we can then simplify the analysis by focusing on a single font at
a time.

To illustrate let’s assume we’ve got data which has a walk across three
pages:

Page 1: uses fonts A and B

Page 2: uses fonts C and B

Page 3: uses fonts A, B, and C

This can be transformed into 3 individual walks by focusing on one font at
a time:

Font A: Page 1 -> Page 3 (we only include codepoints in the pages that are
styled by font A)
Font B: Page 1 -> Page 2 -> Page 3 (we only include codepoints in the pages
that are styled by font B)

Font C: Page 2 -> Page 3 (we only include codepoints in the pages that are
styled by font C)

Thus in the overall analysis we’re still accounting for each of the fonts
in a somewhat realistic fashion (as long as the above assumption that fonts
load largely independently from each other holds.)

   -

   I don’t understand the mention of the 95th percentile. Can you explain
   how you came up with that idea?

The idea is that we want to look at what the worst case behaviour is for a
given solution instead of looking at the average (mean or median)
performance. If we look at how font loading today works the first view
incurs a heavy cost, while subsequent page views are practically free. If
you average this out over many page views things look pretty good. However,
the goal of progressive enrichment as I see it is too significantly reduce
the cost of that first page view by shifting some of the data loading to
subsequent page views. Overall the average number of bytes transferred
stays pretty close to the same, but the worst case behaviour has been
significantly improved.

The 95th percentile is just a place-holder number I put out, but the
intention is that we want to look at the longest font loads incurred by
each solution as a measure for how well they perform. I propose we can
measure this by looking at the tail of the distribution instead of the mean
or median. I’m not completely happy with this approach and we’re still
discussing some alternative approaches. Will definitely share more once we
have some more concrete details.

   -

   Can you talk more about the graph of network requests? What would the
   graphs look like for the different proposals?

For most solutions the graph is pretty simple. For example for the
subset/patch method most page views (for the purpose of this analysis)
would only have a single request and response. Where the graph would come
into play I believe is with your proposed solution. My understanding is
that we would first need to fetch the head of a font which gives us the
table of contents for where glyph data is located and then make one or more
subsequent requests to pull up needed segments of the glyf table. The graph
in that case looks like so:

Head load
 |
 +---------> glyf load 1
 +---------> ...

 +---------> glyf load n

All of the glyf load’s can happen in parallel so the graph is the head load
with a bunch of edges going out to the indivdual glyf segment loads. The
total network delay for this graph is: Head load time + max(glyph load
times)

I’m not super familiar with the specifics of Adobe’s current augmentation
service but I believe it’s pretty similar to the subset/patch case.

Finally, for a unicode range based strategy one or more font requests are
needed which can all happen in parallel. So the graph is just a set of
unconnected nodes and total load time is max(indivdual unicode range
segement load times)


   -

   It appears that you’re attempting to model network latency with the
   formula "estimated network delay = RTT + node.bytes_transferred *
   delay_per_byte”. Instead of building a model, can’t we gather this data
   experimentally? We can’t spy on our users to gather random walks, but we
   can probably do better than purely theoretical modeling. In our experience,
   experimentally-gathered data, even if it only somewhat matches real user
   behavior, provides better results than trying to build a mathematical model
   that mirrors reality.

Agreed that a more realistic model of network behaviour would be much
better. Do you have any ideas on data sets that we could use here?

   -

   I don’t understand the “optimal transfer” function. What is this, and
   what is its value?

The idea behind optimal transfer is, “what if we knew exactly what
codepoints would be needed for all upcoming page views for a user?” then we
could cut a subset that includes just those characters and supply that on
the first page view. This subset represents the lower bound of number of
bytes needed to be sent to the user for a particular sequence of page
views. Practically this is not attainable with any method, but I think it’s
valuable to have it as a point of comparison. We can then make assertions
like “Method X” transfers Y% more bytes than the lower bound.

   -

   The “font families to test with” section seems very Google-Fonts-heavy.
   Many websites use many font loading solutions. I think we should be less
   partial when designing our corpus.

Completely agree, I would love to see fonts or collections of fonts
contributed by other WG members to be added to the corpus of test fonts for
the analysis.

   -

   This document doesn’t include the most difficult part of this whole
   project: coming up with a set of pages that will model user behavior. Do
   you have any thoughts about how we could gather this corpus? (I have some
   thoughts about how we could do it, but I’d like to hear from you guys first
   in case you have a really great solution.)

I agree that this is probably the more difficult part of the problem. We’re
currently trying to chase down some potential data sets internally, but
unfortunately I don’t have anything to share on that front yet and it may
take awhile before I have any updates.

As for alternatives, I have some very rough ideas which I haven’t sketched
out in any detail:

   -

   Look at character frequencies across an index of the web, and then use
   those frequencies to create a model which outputs sample code point sets
   for a hypothetical sequence of page views.
   -

   Generate random page walks across an index of the web and use those.
   Probably will need to target the walks somewhat to ensure they stay
   confined to a set of pages from the same language.

If you have any ideas of how we could tackle this then that would be
helpful :).


On Fri, May 24, 2019 at 7:28 PM Myles C. Maxfield <mmaxfield@apple.com>
wrote:

>
>
> On May 22, 2019, at 2:43 PM, Garret Rieger <grieger@google.com> wrote:
>
> I'd like to get discussion going on how we'd go about analyzing the
> performance of the various proposed enrichment solutions so I wrote up a
> short doc proposing one possible way of constructing an analysis:
>
>
> https://docs.google.com/document/d/1kx62tpy5hGIbHh6tHMAryon9Sgye--W_IsHTeCMlmEo/edit?usp=sharing
>
> This is pretty early stages and I would love to hear your thoughts. You
> can leave comments directly in the doc or reply on this thread. If there's
> interest I suggest we discuss this further at tomorrow's working group
> meeting.
>
>
> Thanks for this! I think this is a really great step in the right
> direction.
>
> A few questions / thoughts:
>
>
>    - In our (the WebKit team’s) experience, optimizing for multiple
>    metrics leads to confusion and unfortunate compromises made. Our task in
>    this working group is to determine which (subset) of these solutions we
>    should provide to the Web Platform. In order to do that, we really need a
>    single fitness function, not three. Otherwise, it’s too difficult to make
>    decisions. Compare this to all popular JavaScript benchmarks: many of them
>    measure many diverse and independent things, but the final result is a
>    single scalar. This gives us a strong signal about decisions we’ve made
>    when implementing the engine.
>    - Why is the whole benchmark posed as being per-font at the highest
>    level? Most websites have many web fonts linked from a single page; Instead
>    of running through the same sequence of pages again and again, and paying
>    attention to a different font each time, we should measure a single
>    sequence including all the fonts linked to from within. This is more
>    accurate and will better reflect the user experience.
>    - I don’t understand the mention of the 95th percentile. Can you
>    explain how you came up with that idea?
>    - Can you talk more about the graph of network requests? What would
>    the graphs look like for the different proposals?
>    - It appears that you’re attempting to model network latency with the
>    formula "estimated network delay = RTT + node.bytes_transferred *
>    delay_per_byte”. Instead of building a model, can’t we gather this data
>    experimentally? We can’t spy on our users to gather random walks, but we
>    can probably do better than purely theoretical modeling. In our experience,
>    experimentally-gathered data, even if it only somewhat matches real user
>    behavior, provides better results than trying to build a mathematical model
>    that mirrors reality.
>    - I don’t understand the “optimal transfer” function. What is this,
>    and what is its value?
>    - The “font families to test with” section seems very
>    Google-Fonts-heavy. Many websites use many font loading solutions. I think
>    we should be less partial when designing our corpus.
>    - This document doesn’t include the most difficult part of this whole
>    project: coming up with a set of pages that will model user behavior. Do
>    you have any thoughts about how we could gather this corpus? (I have some
>    thoughts about how we could do it, but I’d like to hear from you guys first
>    in case you have a really great solution.)
>
>
> Thanks,
> Myles
>

Received on Wednesday, 29 May 2019 23:13:57 UTC