RE: [navigation-error-logging] new draft proposal from Aaron Heady (BING AVAILABILITY) on 2015-01-13 (public-web-perf@w3.org from January 2015)

From: Aaron Heady (BING AVAILABILITY) <aheady@microsoft.com>
Date: Tue, 13 Jan 2015 14:43:49 +0000
To: Ilya Grigorik <igrigorik@google.com>
CC: public-web-perf <public-web-perf@w3.org>, Domenic Denicola <domenic@google.com>
Message-ID: <BLUPR03MB13371D20FFCB87A113C7302D1400@BLUPR03MB133.namprd03.prod.outlook.com>
Some comments inline. I’ll summarize with this system isn’t designed to detect full outages, we should see that in normal internal volume drop telemetry/alerts. And it isn’t designed to detect a single error, there will just be too much noise in the system for any one user or error to matter. The scenarios that this system will really enable are detecting changes in error rates across the set of users, at best in a region (country, office, version of software, some subset of the site). Further, the use the of the .js API to send NEL info back to the telemetry origin is predicated on the well-established user behavior of refresh-on-failure. We’ve all done it, and will continue to do it. When a website fails, the vast majority of us shrug it off as ‘the internet’ and retry. If the site we’re accessing is having an intermittent issue, then we’ll get a page load eventually and we can then get the NEL entries. Remember, we don’t need every user to retry, just a sample of them to establish the change in rate of errors.

As this system is designed to detect intermittent outages, say a buggy DNS deployment, then the information has a very short useful lifespan. I’d argue that the data is of little value after 24 hours, I could even say much shorter and still be happy with the system. If we can’t detect a change in the volume of errors in say 6 hours, then the error rate is probably just part of the background noise and nothing is going to detect it. Every issue that I’ve watched in Bing.com that I would want this system to help detect could have tolerated the data on clients being expired in just 5 minutes. The retry rate and the volume of users is what matters, nothing else.


For the real-time delivery part, ok, I get it. But I don’t like the idea of having millions of clients set to automatically flood me with telemetry when an issue occurs. I’d rather ask them for it, client-side. What’s the case, what type of issue is there that has no-retry/refresh rate to allow for client side processing? I could say something like peering for TWC<>MS was broken during network maintenance. TWC customers cannot reach the MS IP space at all. So refresh isn’t going to work. But neither is sending telemetry to the MS operated telemetry end point. We should pick this problem up at the network management level.

What are some cases that everyone out there is thinking about detecting? In my mind it is all unstable code causing intermittent results/errors of some type. What are your thoughts.

But one specific concern wasn’t addressed: Secure policy delivery on HTTP-only networks. If example.com is hosted on a CDN’s http-only delivery network, it is literally not possible (by policy, not technology) to provision an SSL certificate on the same host. You can’t have HTTP and HTTPS resolve to two different IPs, so it won’t be possible to have TLS, thus this feature can’t be used. My biggest context for this is experience with Akamai, but it certainly can be generalized. Not arguing against TLS in general, just want to understand how NEL will be used in sites that can’t host a secure channel on the same hostname.


Aaron





From: Ilya Grigorik [mailto:igrigorik@google.com]
Sent: Monday, January 12, 2015 4:41 PM
To: Aaron Heady (BING AVAILABILITY)
Cc: public-web-perf; Domenic Denicola
Subject: Re: [navigation-error-logging] new draft proposal

Hi Aaron, thanks for the feedback! Inline..

On Mon, Jan 12, 2015 at 2:22 PM, Aaron Heady (BING AVAILABILITY) <aheady@microsoft.com<mailto:aheady@microsoft.com>> wrote:
I’ll preface this with: I’m not a fan of the Delivery Policy idea, I prefer that the NEL details are in an array like the performance timing entries are and just accessed via client side script.

That said, removing the .js API is really bad. I don’t want .js based registration of the Deliver Policy, I just want .js based access to the NavigationErrorLog object array via .js so I can read it client side at will. How is this different than accessing performance timing info via .js? We can also have policy via headers, but that leads to these questions.

We *have to* provide non-JS delivery to facilitate real-time + reliable reporting:
(a) pure JS solution cannot deliver real-time reports since, by definition, the navigation must have succeeded.. it only enables after-the-fact reporting.
(b) after-the-fact reporting requires that the user comes back later and that load succeeds: this can happen with an arbitrary delay (users decision), or not at all - e.g. I click on a search result or link in some article, it fails to load, I never come back and report is never delivered.
<<Aaron>>  How often do you hit refresh when page fails, that next load is the opportunity for this data, delayed by just seconds.


As a result, I think JS API is at best of very limited value. In practice you'd want the UA to deliver the reports in the background on as-it-happens basis.

Further, adding a JS API exposes new complications:

(a)    it's not clear how long the UA should retain these navigation error logs for? This could add a lot of overhead if user is experiencing poor connectivity and is hitting a lot of errors.
<<Aaron>> Don’t think we need every error. I’d even say a small (5 entry) queue would be enough. I think we only need a sample of the most recent errors and if any one user cycles through the 5 error queue and discards the current ‘real’ error then it’s okay because the other users will relay the right information. No one user matters. I expect the data on the client to be super-lossy, because we don’t need high fidelity data.


(b)   we're back to the same problem of shared buffer and races between various scripts: either you have to diff the nav error logs and avoid clearing the buffers (overhead), or you clear but run the risk of other subscribers missing items (granted, this is not a new issue...)
<<Aaron>> same as above, I don’t care about loss of NEL entries. The aggregate signal from all users will be accurate, they will all lose different parts of the data.

(c) any script (including third party) can iterate over your navigation error logs.. which exposes additional private data about the user+their network without (in my opinion) adding much value due to all of the reasons above.
<<Aaron>> This one is interesting. How is this issue different for any of the other arrays of info, navigation timing? Let’s pick this one apart.

... that said, I'm open to being convinced otherwise. ATM I just don't see any real-world deployments actually using the JS API for all of the reasons above, plus the additional privacy complications for the user (note that CSP does not expose access to error reports for same reasons -- consistency with other APIs is another argument).

The server delivers the NEL policy<https://cdn.rawgit.com/w3c/navigation-error-logging/new/index.html#dfn-nel-policy> to the user agent via an HTTP response header field (NEL header field<https://cdn.rawgit.com/w3c/navigation-error-logging/new/index.html#dfn-nel-header-field>). The policy MUST be delivered over a secure transport. If the policy is delivered over a secure transport with no underlying secure transport errors or warnings, and its format conforms to the specified grammar, the user agent MUST either:

For above, Policy Delivery and Processing: By requiring secure delivery you have the added burden of setting up a secure channel just to deliver the policy on a normal HTTP page. This will have to execute as a background resource on every response we serve because we won’t know if the client has the delivery policy directive already, unless we add a cookie that tracks the expiration date, etc… Should there be request header indicating NEL enrollment so that everyone doesn’t have to roll their own tracking mechanism. Maybe NEL-max-age: 360 from the client would say I’m enrolled and have 360 second left until it expires. (bad header name, but you get the idea)

For, NEL request header: the client would have to send it on every request to a known NEL host, which is no different (modulo a few header bytes) from the server always appending the NEL policy response header. Adding the "time until expires" is also another form of a cookie, which is something I'd like to avoid. I'm not convinced we need this.

But probably a bigger issue: This also skips the fact that some CDNs have clients (domains) setup up on a HTTP only network where DNS resolves to a server that can’t/won’t host SSL, SSL/TLS is on a different IP block. Thus if you can’t get a certificate tied to your domain, you can’t issue policy for that domain. If I’m on an HTTP only CDN network for example.com<http://example.com>, how do I get a policy issued to that domain via a secure connection?

I think NEL qualifies as a "powerful feature" [1], hence HTTPS registration is required - e.g. we don't want a MITM/dowgrade attack to be able to hijack your error reports, hence HTTPS-only registration. That said, note that once registered the policy would apply to both HTTPS and HTTP schemes for that origin -- e.g. an HTTP site can make a background HTTPS request (XHR, iframe, etc) to register the policy; you don't have to be HTTPS-only to take advantage of NEL, you only need HTTPS to manage the registration.

[1] https://w3c.github.io/webappsec/specs/powerfulfeatures/#is-feature-powerful



2.1.1.3 The includeSubDomains Directive
The OPTIONAL includeSubDomains directive is a valueless directive that, if present, signals the user agent that the NEL policy<https://cdn.rawgit.com/w3c/navigation-error-logging/new/index.html#dfn-nel-policy> applies to this NEL host<https://cdn.rawgit.com/w3c/navigation-error-logging/new/index.html#dfn-nel-host> as well as any subdomains of the host's domain name.
For includeSubDomains, what’s the consequence if issued on foo.example.com<http://foo.example.com>? It should then work for *.foo.example.com<http://foo.example.com>, but if subsequently example.com<http://example.com> issues its own includeSubDomains, does that overwrite foo.example.com<http://foo.example.com>, thus *.foo also?

I believe (2.2) should address this: "The user agent must maintain the NEL policy of any given NEL host separately from any NEL policies issued by any other NEL hosts whose domain names are superdomains or subdomains of the given NEL host's domain name. Only the given NEL host can update or cause deletion of its NEL policy"... same logic as HSTS.

Each report URI<https://cdn.rawgit.com/w3c/navigation-error-logging/new/index.html#dfn-report-uri> in the provided set of report URIs<https://cdn.rawgit.com/w3c/navigation-error-logging/new/index.html#dfn-set-of-report-uris> MUST use a secure transport to receive the NEL reports. If any of the provided report URI's<https://cdn.rawgit.com/w3c/navigation-error-logging/new/index.html#dfn-report-uri> does not use a secure transport, the user agent MUST ignore the provided policy. The process of sending navigation error reports to the specified URI's in this directive's value is defined in this documents 2.3 Reporting<https://cdn.rawgit.com/w3c/navigation-error-logging/new/index.html#reporting> section.

If the original user navigation, with all of the potential personal payload, doesn’t have to be secure, why does NEL telemetry have to be secure? Mind you, I like TLS and want to secure things. Just wondering why it is being dictated in this scenario. It also seems like it’s going to drive prices up on telemetry monitoring endpoints, 3rd party or in house.

See: https://w3c.github.io/webappsec/specs/powerfulfeatures/


The REQUIRED report-uri directive specifies a URI to which the user agent sends reports about navigation errors. The ABNF grammar for the name and value of the directive is:

The REQUIRED max-age directive specifies the number of seconds, after the reception of the NEL header field, during which the user agent regards the host (from whom the
Since both report-uri and max-age are required, what if we are just disabling the policy by  setting max-age to 0? Will not having a report-uri header cause the request to be invalid along the lines of the “MUST ignore ….that does not conform” comments earlier in the doc. Should uri only be required if max-age is present and >0?

Right, good catch. This was an editorial shortcut for me, I think we should allow the simple "NEL: max-age=0" as a valid unregister policy.

ig



From: Ilya Grigorik [mailto:igrigorik@google.com<mailto:igrigorik@google.com>]
Sent: Monday, January 12, 2015 12:52 PM
To: public-web-perf
Cc: Domenic Denicola
Subject: [navigation-error-logging] new draft proposal

We identified a number of issues with the current NEL draft at TPAC:

1) JS-based registration can be easily hijacked
2) Ability to aggregate multiple errors into a single report
3) Desire for more extensive error coverage and better delivery model
... more: https://github.com/w3c/navigation-error-logging/issues


In attempt to address all of the above, I have a new draft proposal which is based on our experience with Domain Reliability [1], and also reuses a lot of the concepts from CSP and HSTS:

https://cdn.rawgit.com/w3c/navigation-error-logging/new/index.html


- HSTS~like header based registration
- CSP~like error reporting for failed navigations
-- JS interface is removed entirely for security and privacy reasons, same as CSP
- Domain Reliability~like error types and report structure and delivery

In short, it *is* a significant departure from the current draft, but I do believe that it addresses all the major open issues and provides a consistent interface to similar APIs (e.g. CSP).

Would love to hear any thoughts or feedback!

ig

[1] https://docs.google.com/a/chromium.org/document/d/14U0YA4dlzNYciq2ke0StEMjomdBUN6ocSt1kN03HJ0s/edit?pli=1#<https://docs.google.com/a/chromium.org/document/d/14U0YA4dlzNYciq2ke0StEMjomdBUN6ocSt1kN03HJ0s/edit?pli=1>
Received on Tuesday, 13 January 2015 14:44:25 UTC