Re: Proposing changes to Navigation Error Logging from ttuttle on 2014-07-25 (public-web-perf@w3.org from July 2014)

From: ttuttle <ttuttle@chromium.org>
Date: Fri, 25 Jul 2014 14:21:05 -0400
To: "Aaron Heady (BING AVAILABILITY)" <aheady@microsoft.com>
Cc: "public-web-perf@w3.org" <public-web-perf@w3.org>
Message-ID: <CADyrwZSoROG1eN9knfcw=i_YKVW7GmC8XCBLdGB9GiXpPMjO4A@mail.gmail.com>

On Thu, Jul 24, 2014 at 6:24 PM, Aaron Heady (BING AVAILABILITY) <
aheady@microsoft.com> wrote:

>  Aaron: When a page is failing long enough to get multiple failures for
> the same user/error, it’s probably the exact same error message. I’d be
> more inclined to allow your “delay to collect more reports”, but then dedup
> identical errors within that window and send a count of how many times it
> occurred. But I really feel like 1 sample of the error is likely enough.
>
>
>
> I’m also not an advocate for the automatic telemetry send, but not
> outright against it. I’m more interested in the js access on the next page
> request, the refresh. Then I can do what you suggested, but have total
> control over it.
>

We'd like to be able to detect problems in near real-time, and if a site is
blackholed for some reason, the next page request may not happen until the
issue has been fixed. That's why we're hoping for both options.

 3. I'd like to allow the user-agent to retry the uploads if they fail. If
> the issue is a transient network issue (i.e. a route is flapping), it's a
> waste to throw out the error report just because the network was still
> glitched the first time the upload was attempted.
>
>
>
> Aaron: This reads like a denial of service attack. We did discuss it
> originally, but how do you control the retries when an origin has a short
> lived but widespread spike in errors, especially when the origin for the
> error is also the origin/logging endpoint for these navigation error calls.
> A few seconds after it recovers it gets hit with a global surge in
> telemetry request, knocking if offline, more errors…... Also goes back to
> #1, any error that is stable enough to repro is going to be reported by a
> large number of users. I expect this system to be lossy telemetry wise.
> Optimized to protect the origin, not the error telemetry. And if you wait
> for the next successful page load, then you can get the errors from the
> queue.
>

Hmm, I see your point. I'll see if we can do without retries, or postpone
them until the next time we would've made a new upload anyway.


>  4. I'd like to figure out a way to support logging errors involving
> requests that were not top-level navigations. There are plenty of other
> things that can fail to load that the site owner might not necessarily have
> control over. (For example, Facebook might want to know when parts of their
> social plugin fail to load, even if they are not hosted on a site where
> Facebook can add an error handler.)
>
>
>
> Aaron: I completely agree, and this was one of my largest goals with this
> spec. But it basically became a CORS problem and we agreed that it likely
> wasn’t going to be solved in this first round. So as not to delay getting
> top-level errors, getting a foothold on the problem, we went ahead without
> CORS level errors. I really hope we can change that in the long run. It is
> very compelling. I’d love to discuss that and sort it out.
>

What was the issue? Accessing them with JavaScript? I'd like to arrange
things so that only the origin that the subresource is being requested from
can request monitoring/logging. If Facebook has a script embedded in my
blog, I can't see anything extra about Facebook besides what the platform
already gives me, and they can't see anything extra about me, just the
result of the fetch against their own servers.

Received on Friday, 25 July 2014 18:22:12 UTC