Re: Largest Contenful Paint from Tim Dresser on 2019-04-18 (public-web-perf@w3.org from April 2019)

From: Tim Dresser <tdresser@google.com>
Date: Thu, 18 Apr 2019 11:45:03 -0400
To: Gilles Dubuc <gilles@wikimedia.org>
Cc: public-web-perf <public-web-perf@w3.org>
Message-ID: <CAHTsfZAymsFDDMnejW6Djtt8Jb8a5U9Gq+LBkT=SdnQOWfH5Cw@mail.gmail.com>
Thanks for the thoughts. I agree that the lack of formal research in this
space is unfortunate.

"I find that asking people the same question, eg. "did this page load fast
enough?" over a very long period of time is actually a good way to check if
we are getting worse or better, while sidestepping all the issues of the
ever shifting environment and user expectations."

How does this isolate from the environment and user expectations? Won't the
people you're asking be getting better devices and have changing
expectations?

Tim

On Wed, Apr 17, 2019 at 4:32 PM Gilles Dubuc <gilles@wikimedia.org> wrote:

> Looking at bounces and session length can be deceiving as well, which is
> why we still don't have an "engagement metric" for Wikipedia. Here's why:
> we know that there are very different browsing patterns on Wikipedia. For
> people who come looking for a specific piece of information, a better,
> faster experience would actually result in less time spent on the site and
> higher bounce rates, because they were able to find what they were looking
> for faster. On the contrary, when people come to Wikipedia to read stuff
> for a while and fall into the rabbit hole, a shortened visit is a sign of a
> negative experience. As a result, it's difficult to know if the overall
> bounce rate/session length changing in one direction or another is a good
> thing or a bad thing, because we don't know the underlying intent of the
> user. This is still an active area of research.
>
> As for slowing down the performance on purpose, it proves that people
> notice when things get worse, but they might take improvements for granted,
> particularly if they're gradual. There's also the issue of the
> ever-improving environment and the users' evolving expectations. I can
> clearly see a before/after Christmas effect on RUM metrics. People get
> faster devices, stop using slower ones. How does that affect their
> expectation of a site's speed? We can celebrate overall year-to-year RUM
> metric improvements blissfully ignoring that our performance might actually
> have worsened and this regression was masked by the continuous upgrades of
> the environment (devices + internet speed). I've also been working on
> trying to isolate our responsibility from the environment's. I find that
> asking people the same question, eg. "did this page load fast enough?" over
> a very long period of time is actually a good way to check if we are
> getting worse or better, while sidestepping all the issues of the ever
> shifting environment and user expectations.
>
> You'll see in our study that the usual things people are scared of with
> asking people their opinion directly didn't have much of an effect
> (self-selection bias, outside factors affecting their mood, disappointing
> page content).
>
> Before starting this work I looked at all the formal research that had
> been done on performance perception and there's very little, which is why I
> urge people in the field to do more of it. Here's the literature review I
> did last year of academic research on the subject, there are things in
> there that challenge web performance conventional wisdom:
> https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Perceived_Performance
>
> On Wed, Apr 17, 2019 at 9:39 PM Tim Dresser <tdresser@google.com> wrote:
>
>> Thanks for the feedback, and sorry for the delay.
>>
>> I'd love to see that paper!
>>
>> We have done studies correlating metrics with user opinions of speed, but
>> I'm a bit suspicious of them. User reported speed may not correlate well
>> with the impact of speed on a user, but verifying this is tricky. I think
>> user behavior is a better indicator of performance's impact on user
>> experience. This is hard to evaluate though. We've done some of this
>> research via ablation studies, where we regress browser performance and
>> look at the impact on user behavior, but I don't see a good way to apply
>> that technique to evaluating metrics. We can look for correlation between
>> LCP and attributes of user behavior like abort rate, but it's tricky to
>> separate correlation from causation with that approach.
>>
>> Research based on user reported speed is definitely better than nothing
>> though. I'm working to figure out what we can share of our analyses in this
>> space.
>>
>> Re: LCP not being a building block: from my perspective, Element Timing
>> is the building block, and LCP aims to be the prefabricated solution. It
>> won't be as good as what you can build yourself via Element Timing, but it
>> should provide significantly more value than FCP.
>>
>> Re: User input: we definitely need to put some more thought in here.
>> Thanks for your ideas!
>>
>> Tim
>>
>> On Fri, Apr 12, 2019 at 4:26 PM Gilles Dubuc <gilles@wikimedia.org>
>> wrote:
>>
>>> Some extra (subjective!) feedback on today's presentation. First of all,
>>> I didn't convey that during the call, but thanks for making yet another
>>> attempt at creating a metric that gets closer to the user experience. I
>>> point out the negatives I see, but I'm really happy to see that you're not
>>> giving up on that quest.
>>>
>>> Since I don't want to be the person that only points out issues and
>>> offers no solutions... taking the concept as-is, I think a possible fix
>>> might be to ignore user interaction. But that might pose challenges for the
>>> browser with keeping track of things that are outside of the viewport after
>>> a user scrolls. The API could then signify which portion of the
>>> "originally" largest element before scroll is still visible at the time
>>> it's fully loaded. You could also have 2 elements reported in that case:
>>> one that was the biggest at the time the user scrolled away, and another
>>> that's the one that would have been the biggest if they hadn't scrolled
>>> away.
>>>
>>> Looking at the proposal without changes, I think the main weakness of
>>> this metric is precisely that it tries to model the user psychology beyond
>>> making a simple building block. I consider most, if not all, existing
>>> performance APIs to surface simple building blocks that can be reused and
>>> composed in different ways. Their usefulness usually goes beyond
>>> performance. Making something that has a lot of rules inside of it,
>>> blacklisting special cases, will on the other hand take us away from a
>>> "building block" quality and into something that has to be taken as a
>>> whole. You can't really do much with it besides considering it a
>>> performance score. Because it includes so many special cases that you can't
>>> derive composable meaning from it. That would be fine if we were getting
>>> closer to the holy grail and actually getting a metric that provably
>>> correlated better to what real users feel.
>>>
>>> But the problem is that this seems to be being designed without end user
>>> (web visitor) input. From a logical perspective, you can look at the
>>> description and think "yes, that seems like something users would care
>>> about". But have you asked them if they do care about it? Do they care
>>> about these aspects of the page load combined this way more than things we
>>> can already capture? Maybe they care more about completely different
>>> aspects of the user experience that are a complete blind spot at the moment?
>>>
>>> If the goal is to please developers with something that developers will
>>> think is useful (users still not involved), then yes, I think it reaches
>>> that goal. It makes sense from the point of view of an engineer or product
>>> manager's mindset. Analytics providers can make customers happy by adding
>>> the latest and greatest novelty. But it's a disappointment to me if that's
>>> all we're aiming for.
>>>
>>> In research I've done that will be presented/published next month at The
>>> Web Conference <https://www2019.thewebconf.org/> (I can share the paper
>>> privately with anyone who's interested) I saw that all existing performance
>>> metrics correlate pretty poorly with user opinion about how fast the page
>>> is. We asked users. I think you should too, when coming up with new metrics
>>> like this.
>>>
>>> I'm afraid that if we keep looking in new paint timings in the very
>>> early page load timeframe, we won't get metrics that correlate any better
>>> to user opinion. I have a lot of digging to do into our Element Timing for
>>> Images data in the next couple of months to answer that very question about
>>> that other API (we're still asking our users about their performance
>>> perception), but I will be able to do that. It would be nice, in my
>>> opinion, if the user was involved very early in the metric design. The
>>> status quo is that we can only verify that much further the process, once a
>>> form of that metric is already fully implemented in a browser. And maybe
>>> the early design choices were so disconnected from the user perception that
>>> in the end we're not getting something more valuable than existing cruder
>>> metrics.
>>>
>>> We might be wasting time and effort cutting this small part of the user
>>> experience (above-the-fold timings in the early loading of the page) into
>>> thinner slices that could very possibly not be any closer to user perceived
>>> performance than existing metrics.
>>>
>>> I'd like to see research showing that users care about this particular
>>> slice of the user experience, to gain more confidence that this is actually
>>> better than something like FCP. I think that the resulting metric would be
>>> more attractive to developers if you could show something like X% of users
>>> in the study were happier with the performance when that particular metric
>>> was lower, all other things being equal. Compared with Y% were happier when
>>> FCP was lower, all other things being equal. That would demonstrate that
>>> the metric is measuring something users really perceive that's of higher
>>> importance than existing metrics.
>>>
>>
Received on Thursday, 18 April 2019 15:45:41 UTC