RE: [navigation-error-logging] new draft proposal from Aaron Heady (BING AVAILABILITY) on 2015-01-13 (public-web-perf@w3.org from January 2015)

From: Aaron Heady (BING AVAILABILITY) <aheady@microsoft.com>
Date: Tue, 13 Jan 2015 19:26:33 +0000
To: Ilya Grigorik <igrigorik@google.com>
CC: public-web-perf <public-web-perf@w3.org>, Domenic Denicola <domenic@google.com>
Message-ID: <BLUPR03MB13316D210EA7CAEFFF8F12ED1400@BLUPR03MB133.namprd03.prod.outlook.com>
Ok, I get the policy idea versus sustained issues, like misconfigured network. I’ll cede the issue.

My points would then be:

1.       Using policy isn’t mutually exclusive of .js access.


2.       I don’t understand what NEL .js access is risking/exposing such that it can’t exist.


3.       The secure channel problem for HTTP-only hosts. ((There are simply cases where having SSL isn’t an option, it’s a cost model thing. HTTP only servers don’t have to be in secure cages to meet industry compliance requirements for doing things like credit card processing. By restricting a host to HTTP-only, you don’t have to worry about what kind of transactions they spin up on your less-secure* network, so it costs less to operate. You can configure clients that require HTTPS, thus higher security infrastructure, onto a different network that meets those requirements. This is one reason why the payment portions of sites are on a different hostname than, let’s say the marketplace portion.))


*less-secure is really a compliance distinction, not an operating system issue. Highly-secure compliance requires things like cages, cameras, two-man rules for maintenance, etc… to protect the integrity of the servers.

Aaron


From: Ilya Grigorik [mailto:igrigorik@google.com]
Sent: Tuesday, January 13, 2015 10:48 AM
To: Aaron Heady (BING AVAILABILITY)
Cc: public-web-perf; Domenic Denicola
Subject: Re: [navigation-error-logging] new draft proposal

On Tue, Jan 13, 2015 at 6:43 AM, Aaron Heady (BING AVAILABILITY) <aheady@microsoft.com<mailto:aheady@microsoft.com>> wrote:
I’ll summarize with this system isn’t designed to detect full outages, we should see that in normal internal volume drop telemetry/alerts. And it isn’t designed to detect a single error, there will just be too much noise in the system for any one user or error to matter.

You can make the same argument about performance timing data, yet we provide a complete view and defer how you interpret this data to particular sites and applications.. which, I believe is the right approach: in some cases a single user can matter; we should not assume that you already have a telemetry system, nor that your system is able to distinguish errors from regular daily dips (e.g. 1% of users on network X is having problem accessing your site, that data would simply get lost in the noise of your system today; if you don't care about this.. OK, but some sites do).

Further, the use the of the .js API to send NEL info back to the telemetry origin is predicated on the well-established user behavior of refresh-on-failure. We’ve all done it, and will continue to do it. When a website fails, the vast majority of us shrug it off as ‘the internet’ and retry. If the site we’re accessing is having an intermittent issue, then we’ll get a page load eventually and we can then get the NEL entries. Remember, we don’t need every user to retry, just a sample of them to establish the change in rate of errors.

I disagree with this. You're focused on an intermittent error where an F5 delivers an immediate fix. There is another class of errors which we don't have visibility into that this system enables: blocked access, attacks against the user, network misconfiguration, etc.. a refresh cannot and does not address that. That, plus you're assuming that your user will come back some time later.. which may be true of a popular destination like Bing, but not true of vast majority of long-tail sites: I click on a link, it's dead, I click back and never come back -- no report is ever collected, the site that failed remained oblivious.

Case in point and an example from my own incompetence: https://github.com/igrigorik/istlsfastyet.com/pull/56

- I added AAAA records but did not configure my server to listen on the ipv6 address
- TMo users were unable to reach istlsfastyet.com<http://istlsfastyet.com> ... and *I had no idea for year+*

You can't detect or resolve above case with JS reporting. As written, NEL solves this.

As this system is designed to detect intermittent outages, say a buggy DNS deployment, then the information has a very short useful lifespan.

I don't see why you're assuming that we should restrict ourselves to intermittent outages only -- we shouldn't. We have *no visibility* into persistent misconfiguration / attacks / etc. outages and NEL can finally help us detect and act on this.

I’d argue that the data is of little value after 24 hours, I could even say much shorter and still be happy with the system. If we can’t detect a change in the volume of errors in say 6 hours, then the error rate is probably just part of the background noise and nothing is going to detect it. Every issue that I’ve watched in Bing.com that I would want this system to help detect could have tolerated the data on clients being expired in just 5 minutes. The retry rate and the volume of users is what matters, nothing else.

That's a good data point - thanks. In 2.3 I did provide a "SHOULD" for dropping reports after 24 hours.

 For the real-time delivery part, ok, I get it. But I don’t like the idea of having millions of clients set to automatically flood me with telemetry when an issue occurs.

This is fair, for a large site you may want to sample. In DR we have a "failure_sample_rate" setting in the config. I can see an argument providing that as a configuration option: sample=0.1 .. deliver 10% of error reports, or some such.

What are some cases that everyone out there is thinking about detecting? In my mind it is all unstable code causing intermittent results/errors of some type. What are your thoughts.

Any network error that prevents the user from successfully loading the resource: transient, permanent, attack, whatever. There is a *long* list of reasons why a user may not be able to reach the site.. our job is to deliver the telemetry of when such a failure occurs and leave it to the operator to figure out how/if they want to act on it.

But one specific concern wasn’t addressed: Secure policy delivery on HTTP-only networks. If example.com<http://example.com> is hosted on a CDN’s http-only delivery network, it is literally not possible (by policy, not technology) to provision an SSL certificate on the same host. You can’t have HTTP and HTTPS resolve to two different IPs, so it won’t be possible to have TLS, thus this feature can’t be used. My biggest context for this is experience with Akamai, but it certainly can be generalized. Not arguing against TLS in general, just want to understand how NEL will be used in sites that can’t host a secure channel on the same hostname.

I can't comment about Akamai, but I'm sure they should have a solution for this - if not, they should, or they'll be losing business! As I said earlier, I think NEL falls into "powerful features" bucket and registration needs to be managed over HTTPS.

We *have to* provide non-JS delivery to facilitate real-time + reliable reporting:
(a) pure JS solution cannot deliver real-time reports since, by definition, the navigation must have succeeded.. it only enables after-the-fact reporting.
(b) after-the-fact reporting requires that the user comes back later and that load succeeds: this can happen with an arbitrary delay (users decision), or not at all - e.g. I click on a search result or link in some article, it fails to load, I never come back and report is never delivered.
<<Aaron>>  How often do you hit refresh when page fails, that next load is the opportunity for this data, delayed by just seconds.

I do.. and sometimes it works, but sometimes it doesn't. The fact that sometimes it doesn't is important, and we absolutely need to provide visibility into this case.

(c) any script (including third party) can iterate over your navigation error logs.. which exposes additional private data about the user+their network without (in my opinion) adding much value due to all of the reasons above.
<<Aaron>> This one is interesting. How is this issue different for any of the other arrays of info, navigation timing? Let’s pick this one apart.

You're right, it's not new. I'm just highlighting the pitfalls of exposing the JS API.

ig
Received on Tuesday, 13 January 2015 19:27:06 UTC