RE: Proposing changes to Navigation Error Logging from Aaron Heady (BING AVAILABILITY) on 2014-07-24 (public-web-perf@w3.org from July 2014)

From: Aaron Heady (BING AVAILABILITY) <aheady@microsoft.com>
Date: Thu, 24 Jul 2014 22:24:05 +0000
To: ttuttle <ttuttle@chromium.org>, "public-web-perf@w3.org" <public-web-perf@w3.org>
Message-ID: <f620d9afa07143a3babdb9e96abb38ed@BN1PR0301MB0705.namprd03.prod.outlook.com>

Added some comments inline.

From: ttuttle@google.com [mailto:ttuttle@google.com] On Behalf Of ttuttle
Sent: Thursday, July 24, 2014 2:35 PM
To: public-web-perf@w3.org
Subject: Proposing changes to Navigation Error Logging

Hi,

I'd like to propose a few changes to the Navigation Error Logging draft spec (https://dvcs.w3.org/hg/webperf/raw-file/tip/specs/NavigationErrorLogging/Overview.html):

1. I'd like to allow more than one error report to be uploaded at once, and allow the browser to delay that upload to collect multiple reports. When a page is failing to load, users will often try multiple times, and it would reduce server load if the error reports could be sent together.

Aaron: When a page is failing long enough to get multiple failures for the same user/error, it’s probably the exact same error message. I’d be more inclined to allow your “delay to collect more reports”, but then dedup identical errors within that window and send a count of how many times it occurred. But I really feel like 1 sample of the error is likely enough.

I’m also not an advocate for the automatic telemetry send, but not outright against it. I’m more interested in the js access on the next page request, the refresh. Then I can do what you suggested, but have total control over it.

2. Format-wise, to support that, instead of sending a single entry as a JSON dictionary, I'd like to send a dictionary with a single entry called "entries", with an array of entries. (I'm suggesting a dictionary so that future versions of the spec can add additional fields; the server would be expected to ignore unknown keys in the dictionary.)

3. I'd like to allow the user-agent to retry the uploads if they fail. If the issue is a transient network issue (i.e. a route is flapping), it's a waste to throw out the error report just because the network was still glitched the first time the upload was attempted.

Aaron: This reads like a denial of service attack. We did discuss it originally, but how do you control the retries when an origin has a short lived but widespread spike in errors, especially when the origin for the error is also the origin/logging endpoint for these navigation error calls. A few seconds after it recovers it gets hit with a global surge in telemetry request, knocking if offline, more errors…... Also goes back to #1, any error that is stable enough to repro is going to be reported by a large number of users. I expect this system to be lossy telemetry wise. Optimized to protect the origin, not the error telemetry. And if you wait for the next successful page load, then you can get the errors from the queue.

4. I'd like to figure out a way to support logging errors involving requests that were not top-level navigations. There are plenty of other things that can fail to load that the site owner might not necessarily have control over. (For example, Facebook might want to know when parts of their social plugin fail to load, even if they are not hosted on a site where Facebook can add an error handler.)

Aaron: I completely agree, and this was one of my largest goals with this spec. But it basically became a CORS problem and we agreed that it likely wasn’t going to be solved in this first round. So as not to delay getting top-level errors, getting a foothold on the problem, we went ahead without CORS level errors. I really hope we can change that in the long run. It is very compelling. I’d love to discuss that and sort it out.

Thoughts?

Thanks,

ttuttle

Received on Thursday, 24 July 2014 22:24:38 UTC