Re: Towards a better testsuite from Geoffrey Sneddon on 2016-03-28 (www-style@w3.org from March 2016)

From: Geoffrey Sneddon <me@gsnedders.com>
Date: Mon, 28 Mar 2016 01:40:39 +0100
To: Florian Rivoal <florian@rivoal.net>
Cc: www-style list <www-style@w3.org>
Message-ID: <CAHKdfMhZU+Yw_KT6w-qX5Sq_-yktGGACEpBgW6z=tA_HSpFMqw@mail.gmail.com>
On 25 Mar 2016 04:38, "Florian Rivoal" <florian@rivoal.net> wrote:
>
>
> > On Mar 25, 2016, at 02:00, Geoffrey Sneddon <me@gsnedders.com> wrote:
> >
> > The current status, as I understand it, is: test262 I believe people
> > are mostly running old versions of and contributing little to;
> > Microsoft is running weekly updated versions of csswg-test and Gecko
> > is running a several-year-old version with no realistic plan to update
> > it, nobody contributes that much (a tiny subset of Gecko stuff is
> > automatically synced, but the vast majority is not);
> > web-platform-tests is run by Microsoft semi-regularly, is run with
> > two-way syncing from Gecko and Servo, with plans by Blink and
> > Microsoft to get there AIUI, and with more in the way of contributions
> > than either of the other two repositories. WebKit just aren't running
> > anything, far as I'm aware. The only other group I'm aware of running
> > anything is Prince, running a small subset of an old version of
> > csswg-test.
>
> We run (a growing subset of) csswg-test at Vivliostyle as well.

With, I presume, similar constraints to Prince? So paged media only, no
script execution?

>
> > That said, I think it's worthwhile to reiterate that requiring *any*
> > metadata causes friction. Tests written by browser vendors are rarely
> > a file or two which is quick to add metadata too. I know in general
> > people seem interested in using the same infrastructure to run both
> > web-platform-tests and csswg-test, which essentially requires the
> > metadata required to run the tests be identical across the two.
>
> I think you need to be very careful. Yes, removing all meta data
> lowers friction for test submission, but it increases it
> when you're on the receiving end.
>
> Presumably, when someone adds a test to a browser's private repo,
> even if there's no explicit metadata at all in the test, there
> is implicit knowledge about what this is about. Maybe in a bug
> tracker somewhere. Maybe based on the branch its being added to.
> Maybe just because the person who added it is or knows the person
> who's supposed to make it pass.
>
> But this contextual information isn't passed along when sharing the
> test with other vendors. If some other vendor syncs with a repo
> from someone else with tests in it that have 0 meta data, regularly
> they'll wake up to a test run with a few hundred failing tests,
> and no indication whatsoever what these tests are about. Depending on
> the test, finding out can be far from obvious. This is a great way
> to make sure that failing tests are ignored, or that we stop syncing.
>
> Not sure if that's better or worse, but if you have in that
> lot incorrect tests that pass even though they should really
> not, you'll wind up integrating them in your regression test suite
> (hey, these tests used to pass, we need to keep them green)
> without even being aware of it.
>
> So I'm in favor of as little meta-data as possible, but not of
> no meta data at all. As a consumer of tests, the assertion and
> the link to the related specs are very important.

So we don't have any of that directly in web-platform-tests (though we
effectively have links to spec encoded in the directory structure), and I'm
unaware of anyone having much issue with their absence.

Admittedly, what should be the two primary test types moving forward result
in different levels of difficulty here: for tests from testharness.js the
path and filename combined with looking at which assertion is failing
almost always suffices; with reftests my experience is that the only
particularly useful thing is knowing what part of the spec is being tested,
though even then failures happen for a sufficiently diverse set of reasons
that I'm never convinced assertions help that much—they only provide much
value in complex cases, where one would hope for some sort of description
being given regardless (wpt still essentially has a requirement that tests
much be understandable, and such complex cases are normally dealt with
comments, as they typically are in the case of browser's repos). We do
inevitably have more reftests than JS tests, so in a sense we have the
harder problem, compared with wpt.

I think we should check with people who've been dealing with triaging
failures from wpt as to how much of a benefit it is.

> > The other notable difference is in tooling: Mercurial is used with a
> > git mirror, and then reviews are done split across Shepherd and
> > public-css-testsuite and some issues filed on GitHub, and with some
> > people expecting some pre-landing review through GitHub PRs and with
> > some people pushing directly to Mercurial… Really everything would be
> > simpler if we had *one* single way to do things. I'd much rather have
> > everything on GitHub, review happening on PR submission, and nits and
> > such like be reported as GitHub issues. This keeps everything in one
> > place with tools most people are used to.
>
> One system is better than two, and my personal preference goes to GitHub
> as well, although I can deal with the alternative.
>
> One of the problem with the GitHub workflow is that it doesn't include
> an easy way to run/preview the tests. Sure, you can always check it out
> locally and run the tests there, but that's quite a bit of overhead
> compared to the previews offered by shepherd.
>
> If we move to a github based workflow (which I would support), I
> would want a replacement of some kind to that.

For wpt we have master on w3c-test.org, though that doesn't quite suffice
for running them all given some rely on specific host names resolving. It's
also not the easiest way to run reftests, but I'm not sure what can be done
to make that easier: determining pixel by pixel equivalence by eye will
always be hard (and really, quickly changing between tabs/frames can easily
end up with things being incorrectly marked, it only really makes sense to
do programmatically).

>
> > To outline what I'd like to see happen:
> >
> > - Get rid of the build system, replacing many of it's old errors with
> > a lint tool that tests for them.
>
> Fine by me.
>
> > - Policy changes to get rid of all metadata in the common case.
>
> Reduce, yes. Get rid of all, no.
>
> I don't care about the title. It's nice if it's well though out, but it's
not that important.
>
> Authors and reviewers (at least for my purposes) are adequately covered
by git/hg.
>
> I want the assertion and the spec links (at the very least one of the
two, but I really want both).
>
> For flags, as you explored in your earlier mail, we can get rid of most.
But some (which aren't needed for most tests, but significant for some)
should stay:
> - animated
> - interact
> - may
> - should
> - paged
> - scroll
> - userstyle

It's probably worthwhile to point out again that metadata that isn't used
almost invariably ends up wrong: may/should seems like it very much in that
category. animated/interact/userstyle are essentially interchangeable for
all browser automation systems (indeed, wpt handles all three with a single
solution: "_manual" as a suffix on the filename), so are likely end up not
being quite accurate (at least historically, "interact" was used almost
everywhere, from memory). Yes, in theory separating them out may benefit
someone, but in practice it's more metadata for people to get wrong. scroll
I expect is omitted from many tests that need it (because there haven't
been fixes coming from people testing with paged media only).

/Geoffrey
Received on Monday, 28 March 2016 00:41:11 UTC