RE: [ErrorLogging] Draft Specification from Aaron Heady (BING) on 2013-04-02 (public-web-perf@w3.org from April 2013)

From: Aaron Heady (BING) <aheady@microsoft.com>
Date: Tue, 2 Apr 2013 17:20:11 +0000
To: Ilya Grigorik <igrigorik@google.com>, "Reitbauer, Alois" <Alois.Reitbauer@compuware.com>
CC: Jatinder Mann <jmann@microsoft.com>, "Austin,Daniel" <daaustin@paypal-inc.com>, "public-web-perf@w3.org" <public-web-perf@w3.org>
Message-ID: <d106fa84277c47ee97f0c4f02abfb59c@BLUPR03MB067.namprd03.prod.outlook.com>
Hello,

Just for some context about my view on how this feature could operate: I've worked as the senior live site service engineer for Microsoft's Bing search engine for the last 5 years. During that time I've been on-call as an incident manager for all Bing services and my team is specifically responsible for operating our all of our edge services, whether MS owned or third party.

We spend a lot of effort to monitor the health of our service, as I'm sure we all do, and over the last year we have been particularly focused on extending that monitoring into every ISP/AS around the world so that we can detect intermittent connectivity issues, flaky services that only manifest under certain circumstances, localized service disruptions due to infrastructure issues and the like. We see this draft feature as the ultimate end user monitoring solution, once it gets rolled out to several browsers.

There have been a couple of key questions brought up so far, in no particular order:

1.       Is getting the data from the subsequent page load real-time enough?

2.       Could we use a preloaded script to get real-time?

3.       Monitoring data will be sent to a different infrastructure than your own.


I'm going to start with #3 because it impacts #1 & #2. The assumption that the logging infrastructure (for monitoring data) is substantially separate from the service it is reporting on is true in many cases, perhaps even ideal. But it is very likely that many services use at least an overlapping piece of network hardware between their service and logging infrastructure. Using third party services as telemetry endpoints can be great, but doesn't necessarily scale to the size of services like Bing. Even if we hosted our telemetry endpoints on different segments of the MS network, and in different datacenters than our Bing services, they could still both be impacted by Microsoft core routing or peering problems.

So I would assert that our solution has to assume that both situations are likely and supported:
Logging is completely separate from the service being monitored.
Logging is cohosted with the service being monitored.

Given that, we have to use a store and forward method to reliably relay the signal during availability issues detected by end users. That leads us to the first two items.

For #1, pulling the error data from a local store on subsequent page loads, for active services with a lot of users, I envision this as a very useful and very timely source of error data. I refer to it as near real-time when I discuss it. The common example I cite is that if an error is repro'ing with any volume then it is likely to be seen by a number of users. Within that set of users, some of them would really like to use the service and when they see an error they are just going to hit F5 or click on the shortcut again and try to reload the page. That is a common pattern that has been ingrained into users since the inception of the modern browser; they just expect the internet to be a bit flaky and they retry. That set of users retrying will provide the details of the error they experience within seconds of it occurring.

As a real world example about retries: When Bing does have a service impacting issue we often see search requests per second increase, not decrease, as users get an error and retry, versus get a result and move on to another website. Our alerts monitor for unexpected increases in requests to detect problems.

#1 Cons:

1.       If a user sees an error and doesn't try the page again we don't get the data until some later date when it likely doesn't matter.
#1 Pros:

1.       For all but the lowest volume sites, errors will be relayed within seconds.

2.       Control the volume error reporting. Telemetry is only sent back when script is rendered to end user. Allows for service-side control of who sends telemetry back and when.

For #2, preloaded script that can execute for a domain, I honestly think it's an interesting idea and discussed it with several groups within MS. The most (and really the only) common concern was the lack of control once the script is deployed. A concern that a poorly written script or a good script combined with some unforeseen potential for abuse could lead to millions of browsers turned into a DDOS against a service. I'd like to see this idea continued forward as it has some very interesting possibilities. I feel that given the amount of risk involved, the design will take much longer to decide on than #1 and it would better as a vNext feature of error logging. Also, one question to clarify the idea: what would the script do if it couldn't immediately contact the telemetry origin?

#2 Cons:

1.       Risk of rogue script and DDOS.

2.       Higher complexity to design transmission of script and safe operation.

3.       Trying to relay telemetry to an origin that is currently having issues for that user.
#2 Pros:

1.       Closer to real-time action.

2.       Likely to have lots of flexibility.

3.       Could enable whole new scenarios.

Looking at the Content Security Policy docs linked by Ilya, it is a hybrid of 1 and 2, and looks to not be reliable in the face of failures to relay telemetry, if I read it correctly & we could improve on that. To paraphrase it into error logging context: It would basically formalize the error telemetry message transmission (hard coded into the browser) and only trigger sending the message when an appropriate HTTP header was present. This would have the benefit of being controlled by the origin and could be real-time if allowed to execute on-error (still a risk), but lacks the flexibility of having the script designed by the service owner, which both #1 and #2 have. I wouldn't object to also having a standard hard coded method that was triggered either by a HTTP header or JS call and just sent the entire error buffer to the provided URI. It would enable a basic scenario for operators that cannot invest more. It's even likely that a vNext would create a full enough hard coded design that it would work for most services. I just doubt we've considered everything for new feature like this to justify not providing the flexibility of .js in this version.

In summary, we need to expect that the telemetry origin is the same as the service origin when designing the system. That means it needs to always hold the data until it can reliably send it. Method #1 can be implemented safely and with very little difficulty in the design. I'd like to consider method #2 as part of a separate discussion that determines how preloaded scripts could be utilized safely and if there are scenarios other than error logging where they could apply. I think once the implementation details are worked out, preloaded scripts could just be hooked up to extension points on browser events and could handle things that we haven't even considered yet. Creating a triggered hard coded method to return data would be a nice option for the basic operators and could allow easy integration with third party monitoring services, and eventually become the de facto way to interact with error logging.

Thanks,

Aaron






From: Ilya Grigorik [mailto:igrigorik@google.com]
Sent: Sunday, March 31, 2013 5:32 PM
To: Reitbauer, Alois
Cc: Jatinder Mann; Austin,Daniel; public-web-perf@w3.org
Subject: Re: [ErrorLogging] Draft Specification


On Fri, Mar 29, 2013 at 5:55 AM, Reitbauer, Alois <Alois.Reitbauer@compuware.com<mailto:Alois.Reitbauer@compuware.com>> wrote:
In the current example this is exactly what happens. I don't see the value in saving this information and sending it later when the connection is up again. For an operational issue like this live data is crucial. Having this information after the fact is not valuable.

Perhaps that should read.. not *as* valuable? If my DNS server can't resolve your hostname, then there is no magic trick for somehow instantly reporting that to your server. I think, by definition, an error reporting mechanism would have to buffer certain class of errors, and report them later. Incidentally, defining how much to buffer, and when to beacon this data is a whole separate discussion.

This reminded my of a RUM discussion we had internally a while ago. The main point was to allow loading JavaScript via HTTP headers and also define expiration. Think of it as a cookie, but with a JavaScript file. The script will then be executed after it is loaded. If it is set on a domain, it will always be executed.

Combining this approach with error logging would allow to immediately alert in cases where the actual document cannot be loaded. The script - with monitoring code - will be cached and executed although the page cannot be loaded. The script extracts the error information and beacons it to a monitoring infrastructure (obviously not your servers, which are down).

I'm not sure I follow. Are you literally saying "Header: <javascript here>"? Because if so, that seems like a very expensive way to do this.

I think we should take a close look at CSP policy and error reporting mechanisms.. They've already shipped this, and there is a lot to be said for re-use of same mechanisms and concepts:
- http://www.w3.org/TR/CSP/#content-security-policy-report-only-header-field
- http://www.w3.org/TR/CSP/#sample-violation-report

I'd prefer an automated solution, like CSP approach above, instead of a manual implementation (as described in the spec currently).

ig
Received on Tuesday, 2 April 2013 21:03:52 UTC