Re: Mismatch between CSS and web-platform-tests semantics for reftests from Peter Linss on 2014-09-04 (public-test-infra@w3.org from July to September 2014)

From: Peter Linss <peter.linss@hp.com>
Date: Wed, 3 Sep 2014 17:12:18 -0700
To: Dirk Pranke <dpranke@chromium.org>
Cc: James Graham <james@hoppipolla.co.uk>, public-test-infra <public-test-infra@w3.org>
Message-Id: <B8FC3932-8186-45A5-B615-8E1DB2C4FB60@hp.com>
On Sep 3, 2014, at 1:11 PM, Dirk Pranke <dpranke@chromium.org> wrote:

> On Wed, Sep 3, 2014 at 11:10 AM, Peter Linss <peter.linss@hp.com> wrote:
> 
> On Sep 3, 2014, at 6:41 AM, James Graham <james@hoppipolla.co.uk> wrote:
> 
> > On 20/08/14 01:22, Peter Linss wrote:
> >
> >>> Are these features something that any actual implementation is
> >>> running? As far as I can tell from the documentation, Mozilla
> >>> reftests don't support this feature, and I guess from Dirke's
> >>> response that Blink/WebKit reftests don't either. That doesn't
> >>> cover all possible implementations of course.
> >>
> >> I'm actually in the middle of a big cleanup of our test harness and
> >> it will support this feature when I'm done (so far we haven't been
> >> able to represent the situation in our manifest files properly, I'm
> >> fixing that too).
> 
> This is now online. In our manifest files, we now list "reference groups" separated by semicolons, within each group, references are separated by commas. A test must match any of the reference groups, and must match (or mismatch) all references within a group. So, for example, the entry for background-color-049 looks like:
> background-color-049    reference/background-color-049-020202-ref;reference/background-color-049-030303-ref
> 
> Peter, I never did see any answer from you to my questions earlier in the thread, and this reply uses the same potentially confusing wording. 
> 
> To confirm: when you say "must match all references in a group", you're really saying that the references themselves might also be tests, right? i.e., you can do pairwise testing and get coverage transitively, right? 

Those are two different things. Firstly, yes, tests can be references to other tests. When you do that, however, you need to be sure that both tests can't fail in the same way, otherwise you get false positives. We have examples where one test has "normal" references, and then that test is used as a reference for another test.

The grouping thing is another matter. First a single test can have multiple references that must all be matched (or mismatched) in order to pass. This is used when it's possible for a reference to fail to render properly and could result in the same visual result as a failed test. We counter that by providing another reference that would fail in a different way, or by providing mismatch references that render the same way as a failed test or failed reference.

In order to describe the above, the test links to a single reference, then the reference links to another, and so on, forming a chain (or possibly a loop).

An example of this is 2d-rotate-001 which has a rel="match" link to 2d-rotate-ref, which in turn has a rel="mismatch" link to 2d-rotate-notref (which in turn has a rel="mismatch" link back to 2d-rotate-ref, forming the loop).

2d-rotate-001 must match 2d-rotate-ref and must not match 2d-rotate-notref.

Tests can also have multiple "reference groups", in this case the groups provide alternate renderings which, while different, are also considered a pass. So the test can match either reference group and still pass (it just has to match at least one of them, but if there are multiple references within the group, it must match all of those).

An example of this is background-color-049 (sorry, I incorrectly listed color-049 earlier) which uses percentage based colors. The color in the test can round to either #020202 or #030303 depending on the implementation, and the exact rounding behavior isn't specified. So the test has rel="match" links to _both_ background-color-049-020202-ref and background-color-049-030303-ref.

> 
> I can't think of a reason that in order to see if color-049 rendered correctly you would need to check -020202 *and* 030303 against 049, as long as you compared 020202 against 030303?

In this case 020202 and 030303 do NOT match, and should not. You compare the test to both and as long as one of them matches the test, the test passes.



> 
> Does that make sense?
> 
> 
> >
> > So I was looking at adding this to web-platform-tests and the current
> > design adds some non-trivial complexity. As background,
> > web-platform-tests uses a script to auto-generate a manifest file with
> > the test files themselves being the only required input. This is
> > rather slow, since it involves actually parsing the *ML files and
> > inspecting their DOM. Therefore it is important to be able to perform
> > incremental updates.
> 
> FWIW, we have a build step that scans the entire repository looking for tests, references, and support files, parses all the *ML files, generates manifests, human readable indices, then generates built test suites by re-serializing all *ML input files into HTML, XHTML, and XHTML-Print output files (where applicable). It also adjusts relative paths for reference links so that they remain correct in the built suites. The process currently takes about 6 minutes consuming over 21000 input files and generating 24 test suites. It has not been optimized for speed in any way at this point. Given that it runs daily on a build server, the burden is completely manageable.
> 
> Somewhat off-topic, but what is this system you describe in the "we have a build step"? It sounds like this isn't shepherd, but something else?

Correct, Shepherd is the repository manager, it provides a view of the repository with searching by metadata, it also tracks review status and issues for the files in the repository. Shepherd also manages the bi-directional syncing between our mercurial repository and GitHub.

There is a separate build process which runs on the csswg.org server every night and produces the built test suites, which can be found at:
http://test.csswg.org/suites/

The build code is a python script that scans the test repository, parses the tests to find spec links, assigning the tests to the appropriate suites based on the spec links, generates the built test suites (converting each test to both HTML and XHTML output), and generates machine readable manifest files containing the test metadata as well as human readable index pages for each suite.

Both Shepherd and the build code use the w3ctestlib to parse the tests and extract metadata from HTML, XHTML, XML and SVG source files. w3ctestlib also has code to perform format conversions between HTML and XHTML (and XHTML-Print).

> Do you have something running tests against (some set of) browsers?

Yes, we also have our test harness at:
http://test.csswg.org/harness/

The harness imports the built test suites each night (using the manifest files generated by the build) and allows users to run tests in browsers (and also supports gathering result data from non-browser clients). The test harness tracks results to specific versions of tests so the result data is automatically updated when tests change. It also generates implementation reports suitable for advancing specs on the REC track. The harness also provides the specification annotations that we have on our editor's drafts (and soon on /TR as well) which show testing status inline. Those test annotations are gathered live from the harness.

Since the test harness is primarily geared toward running manual tests (though it can automatically record results from testharness.js tests) it also presents the tests in a "most needed" order, prioritizing tests where getting results will help us get a spec out of CR. The test harness currently has over 324,000 test results in its DB.

Peter

> 
> -- Dirk
>  
> >
> > Currently it is always possible to examine a single file and determine
> > what type of thing it represents (script test, reftest, manual test,
> > helper file, etc.). For example reftests are identified as files with
> > a <link rel=[mis]match> element. Since (unlike in CSS) tests in
> > general are not required to contain any extra metadata, allowing
> > references to link to other references introduces a problem because
> > determining whether a file is a reference or a test now requires
> > examining the entire chain, not just one file.
> 
> I don't understand why you have to parse the entire chain to determine if a single file is a test or a reference, if a file has single reference link, then it's a reftest, regardless of how many other references there may be. You do, of course, have to parse the entire chain to get the list of all references for the manifest, but really, that's not adding a lot of files to be parsed, many tests reuse references, and we use a cache so each file is only parsed once.
> 
> For that matter, at least in CSS land, we don't differentiate between tests and references based on the presence of {mis}match links, those only indicate that a test is a reftest. The difference between tests and references is done solely by file and directory naming convention. References are either in a "reference" directory or have a filename that matches any of: "*-ref*", "^ref-*", "*-notref*", "^notref-*". Furthermore, it's perfectly valid to use a _test_ (or a support file, like a PNG on SVG file) as a reference for another test, we have several instances of this in the CSS repo.
> 
> >
> > Obviously this isn't impossible to implement. It's just more
> > complicated than anything else in the manifest generation, all in
> > order to support a rarely-used feature. Are the benefits of the
> > approach where the data is distributed across many files really great
> > enough, compared to an alternate design where we put all the data
> > about the references in the test itself, to justify the extra
> > implementation burden? As far as I can tell the main benefit is that
> > if two tests share the same reference they get the same full chain of
> > references automatically rather than having to copy between files.
> 
> Which is valuable in itself, anything that removes a metadata burden from test authors is a win. It also allows for describing complex reference dependencies as well as allowing the alternate reference. Yes, those can be done by alternate approaches but those add complexity (and an opportunity to make mistakes) for the author, as opposed to the build tools.
> 
> Also, let me point out again, that the bulk of the code we use for our build process is in a factored out python library[1]. It can use some cleanup, but it contains code that parses all the acceptable input files and extracts (and in some cases allows manipulation of) the metadata. Shepherd also uses this library to do it's metadata extraction and validation checks. If we can converge on using this library (or something that grows out of it) then we don't have to re-write code managing metadata... I'm happy to put in some effort cleaning up and refactoring this library to make it more useful to you.
> 
> Peter
> 
> [1] http://hg.csswg.org/dev/w3ctestlib/
Received on Thursday, 4 September 2014 00:12:46 UTC