Re: Towards a better testsuite from Florian Rivoal on 2016-03-28 (www-style@w3.org from March 2016)

From: Florian Rivoal <florian@rivoal.net>
Date: Mon, 28 Mar 2016 12:47:52 +0900
To: Geoffrey Sneddon <me@gsnedders.com>
Cc: www-style list <www-style@w3.org>
Message-Id: <FECB433B-FB1F-4D85-8B1D-E66D0554FACD@rivoal.net>
> On Mar 28, 2016, at 09:40, Geoffrey Sneddon <me@gsnedders.com> wrote:
> 
> 
> On 25 Mar 2016 04:38, "Florian Rivoal" <florian@rivoal.net <mailto:florian@rivoal.net>> wrote:
> >
> >
> > > On Mar 25, 2016, at 02:00, Geoffrey Sneddon <me@gsnedders.com <mailto:me@gsnedders.com>> wrote:
> > >
> > > The current status, as I understand it, is: test262 I believe people
> > > are mostly running old versions of and contributing little to;
> > > Microsoft is running weekly updated versions of csswg-test and Gecko
> > > is running a several-year-old version with no realistic plan to update
> > > it, nobody contributes that much (a tiny subset of Gecko stuff is
> > > automatically synced, but the vast majority is not);
> > > web-platform-tests is run by Microsoft semi-regularly, is run with
> > > two-way syncing from Gecko and Servo, with plans by Blink and
> > > Microsoft to get there AIUI, and with more in the way of contributions
> > > than either of the other two repositories. WebKit just aren't running
> > > anything, far as I'm aware. The only other group I'm aware of running
> > > anything is Prince, running a small subset of an old version of
> > > csswg-test.
> >
> > We run (a growing subset of) csswg-test at Vivliostyle as well.
> 
> With, I presume, similar constraints to Prince? So paged media only, no script execution?
> 
For now, yes (page media only mean we run the tests in paged media, not that we run the test that are only appropriate for paged media). In the longer run, we're looking at removing both constraints.
> Admittedly, what should be the two primary test types moving forward result in different levels of difficulty here: for tests from testharness.js the path and filename combined with looking at which assertion is failing almost always suffices; with reftests my experience is that the only particularly useful thing is knowing what part of the spec is being tested, though even then failures happen for a sufficiently diverse set of reasons that I'm never convinced assertions help that much—they only provide much value in complex cases, where one would hope for some sort of description being given regardless (wpt still essentially has a requirement that tests much be understandable, and such complex cases are normally dealt with comments, as they typically are in the case of browser's repos). We do inevitably have more reftests than JS tests, so in a sense we have the harder problem, compared with wpt.
> 
> I think we should check with people who've been dealing with triaging failures from wpt as to how much of a benefit it is.
> 
Right. Js tests make the assertion explicitly in code, so even if you don't have it in prose, you have it.

For reftests (or "can't-quite-make-it-a-ref-test visual tests), I think the assertion delivers part of its value when you're trying to make a UA pass the test, but it also helps a lot during review.

As far as the path of the file carying information about the spec it is testing, that's sort of true, but it doesn't indicate which part of a spec, which is useful, and it doesn't help when testing multiple specs at a time.

This may only have marginal value when you're trying to fix UA to make it pass a failing test (although not no value. Narrowing it down to the right part of the spec is useful), it is also pretty important in evaluating whether we have good test coverage for a spec. Especially if the tests come from multiple uncoordinated sources. Both for REC track purposes and for general interop purposes, it is useful to now is something is well tested, or if there are large gaps.
> > > The other notable difference is in tooling: Mercurial is used with a
> > > git mirror, and then reviews are done split across Shepherd and
> > > public-css-testsuite and some issues filed on GitHub, and with some
> > > people expecting some pre-landing review through GitHub PRs and with
> > > some people pushing directly to Mercurial… Really everything would be
> > > simpler if we had *one* single way to do things. I'd much rather have
> > > everything on GitHub, review happening on PR submission, and nits and
> > > such like be reported as GitHub issues. This keeps everything in one
> > > place with tools most people are used to.
> >
> > One system is better than two, and my personal preference goes to GitHub
> > as well, although I can deal with the alternative.
> >
> > One of the problem with the GitHub workflow is that it doesn't include
> > an easy way to run/preview the tests. Sure, you can always check it out
> > locally and run the tests there, but that's quite a bit of overhead
> > compared to the previews offered by shepherd.
> >
> > If we move to a github based workflow (which I would support), I
> > would want a replacement of some kind to that.
> 
> For wpt we have master on w3c-test.org <http://w3c-test.org/>, though that doesn't quite suffice for running them all given some rely on specific host names resolving. It's also not the easiest way to run reftests, but I'm not sure what can be done to make that easier: determining pixel by pixel equivalence by eye will always be hard (and really, quickly changing between tabs/frames can easily end up with things being incorrectly marked, it only really makes sense to do programmatically).
> 
Having it on the master branch mean you can't use it for reviews.

Judging pixel equivalence by eye is indeed hard, but I don't think that's the main point. I'd say sanity check matters more. If you review a test, and it seems to make sense to you, but it's a complicated one, you want to run it to make sure it does what you think it does. Having a low friction solution available at review time makes reviewing easier and more reliable.

This is not unique to tests, and plenty of workflow get by without having this, but we currently have this in shepherd, so it would be nice to preserve it.

> > > To outline what I'd like to see happen:
> > >
> > > - Get rid of the build system, replacing many of it's old errors with
> > > a lint tool that tests for them.
> >
> > Fine by me.
> >
> > > - Policy changes to get rid of all metadata in the common case.
> >
> > Reduce, yes. Get rid of all, no.
> >
> > I don't care about the title. It's nice if it's well though out, but it's not that important.
> >
> > Authors and reviewers (at least for my purposes) are adequately covered by git/hg.
> >
> > I want the assertion and the spec links (at the very least one of the two, but I really want both).
> >
> > For flags, as you explored in your earlier mail, we can get rid of most. But some (which aren't needed for most tests, but significant for some) should stay:
> > - animated
> > - interact
> > - may
> > - should
> > - paged
> > - scroll
> > - userstyle
> 
> It's probably worthwhile to point out again that metadata that isn't used almost invariably ends up wrong: may/should seems like it very much in that category.
> 
Should/may seem trivial to support in sorting the results of a test run between must/should/may categories, and that seems like a useful distinction, so I'd rather keep it. But if nobody plans to support it, I guess I won't miss it that much. Regardless of the must/should/may requirement, if a UA vendor decides they don't care about passing a test, they can mark that test as to-be-ignored in so separate vendor specific meta-data storage.
> animated/interact/userstyle are essentially interchangeable for all browser automation systems (indeed, wpt handles all three with a single solution: "_manual" as a suffix on the filename), so are likely end up not being quite accurate (at least historically, "interact" was used almost everywhere, from memory). Yes, in theory separating them out may benefit someone, but in practice it's more metadata for people to get wrong.
> 
I'm ok with merging these 3. Regardless of whether it's a manual test because the setup is manual (userstyle), the execution is manual (interact) or the evaluation is manual (animated), it's manual anyway. Manual vs not is an important distinction, the sub-category less so.
> scroll I expect is omitted from many tests that need it (because there haven't been fixes coming from people testing with paged media only).
> 

Right. I would expect this to remain a valid flag, most test authors to forget about it even when it's appropriate, and  people like us or prince to submit fixes after the fact when we run into tests that should be marked but aren't.

 - Florian
Received on Monday, 28 March 2016 03:48:19 UTC