Re: Simplifying metadata from Geoffrey Sneddon on 2015-10-29 (public-css-testsuite@w3.org from October 2015)

From: Geoffrey Sneddon <me@gsnedders.com>
Date: Thu, 29 Oct 2015 15:05:19 +0900
To: Gérard Talbot <css21testsuite@gtalbot.org>
Cc: Public CSS Test suite mailing list <public-css-testsuite@w3.org>
Message-ID: <CAHKdfMhbTB9tBnXMYmnPTy9wXETBZJM=yELR-YdpyZRrdNLv3w@mail.gmail.com>
On Thu, Oct 29, 2015 at 6:09 AM, Gérard Talbot <css21testsuite@gtalbot.org>
wrote:

> Le 2015-10-27 03:31, Geoffrey Sneddon a écrit :
>
>> Yo!
>>
>> Can we revisit all the metadata we have in the tests? The metadata we
>> *need* is what is sufficient to be able to run the tests, probably within
>> CI systems.
>>
>
> What about other needs like searching for tests? reviewing tests? test
> coverageability?
>

We had some discussion yesterday as to what the priorities for the test
suite should be. Those there roughly all agreed:

1) The priority should be to get browsers *regularly* running the tests
(because this disproportionately helps with the goal of interoperably).
2) We should make it easier for browser vendors to submit their tests (they
have tens of thousands which haven't been submitted, far more than anyone
else has).
3) We should ensure the test suite is able to be used for the purpose of CR
exit requirements (note these have changed in the 2015 Process document,
and given a larger number of tests it will always require a fair amount of
analysis of results).


> What is a "CI" system?
>

A continuous integration system—where tests are run on every push of the
browser. If we want to ensure browsers implement the spec, and don't
regress their implementations, we need to ensure the tests are run
regularly, which practically means they're in their CI system.


> Some existing web-aware softwares with good CSS competence do not support
> printing or svg or some other features (scrollability, 32bit, XHTML,
> transitions, etc..). So those flags can be useful.


They *can* be. Metadata almost invariably ends up out of date if nobody is
actually making use of it, so there's no point in having metadata nobody is
making use of.


> I'm going to go by the assumption (which I think has shown itself to be
>> true on numerous occasions) that the more metadata we require in tests the
>> more hoops people have to jump through to release tests, which discourages
>> submitting tests.
>>
>
> I'd like to believe this discouragement isn't as bad as you believe...
>
> 8 months ago (feb. 15th 2015), I've asked one major browser manufacturer
> why they did not submit tests and never got a response.


Discussing this at TPAC, *all* the browser vendors said this was *the*
reason they didn't submit more tests. It's too much work to add metadata,
and it's hard to justify the man-hours to change them to meet the
requirements.


> And this WG has a real problem with getting good,
>> high-quality test suites such that we're able to advance tests beyond CR.
>>
>
> I'm all for a discussion on metadata and documentation ... but what about
> bad or wrong tests?
> I wish
> a) incorrect tests,
> b) imprecise tests,
> c) tests that can not fail (unreliable tests, non-trustworthy tests) and
> d) tests that do not check what they claim to be checking
> would be removed or rehabilitated or dealt with to start with. I've been
> asking for this in the last 4 years (june 28th 2011 and even before that)
> and it still has not been dealt with. And I'm not talking about a few dozen
> tests here...


a) and b) are quite easily found if browsers are actually running them and
are able to contribute fixes back upstream (which is another problem we've
had for years). c) and d) are far harder to deal with, because
realistically when mistakes slip through review it's hard to find them. Is
it worth designing a system to find them? Probably not. If it adds
complexity to the common cases then it's not worth it—having them lying
around bogusly passing doesn't hurt much.

Disclaimer: this is all based on <
>> http://testthewebforward.org/docs/css-metadata.html>; I'm not totally
>> sure
>> this actually reflects the status-quo of the tests.
>>
>> a) "CSS 2.1 Reference" as a <title> for potentially hundreds of references
>> is utterly useless—I'd rather doing something more descriptive as to what
>> it is. Presto-testo has titles like:
>>
>>  * Reference rendering - this should be green (green text)
>>
>
> Not the shortest IMO but that is another issue.
>
> Isn't that redundant? Test title says: "this should be green (green text)"
> and the pass-fail-conditions sentence says the same thing.
>
> Where's the best location to describe the reference rendering? In the
> <title> text or in the filename?


Really the filename is far more important, IMO. I agree with fantasai that
titles of references really aren't very important.


>  * Reference rendering - There should be no red below
>>
>
> "below" can safely be dropped IMO.
>
>  * Reference rendering - pass if F in Filler Text is upper-case
>>
>
> Personally, I think about 30% to 40% of all existing tests could be
> re-engineered so that they would be associated with already available,
> already created and very frequently reused reference files. When I create a
> test, I always try to do this myself. That way,
> a) I no longer have to think about creating a reference file,
> b) this reduces server load when "doing" a test suite with the test
> harness and
> c) this reduces the growth of N reference files to be referenced
>
> Examples given:
>
> ref-if-there-is-no-red  : referenced by 290 tests (2 changesets!)
>
> http://test.csswg.org/shepherd/search/reference/name/ref-if-there-is-no-red/
> http://test.csswg.org/source/css21/reference/ref-if-there-is-no-red.xht
>
> ref-this-text-should-be-green : referenced by 43 tests
> test.csswg.org/shepherd/reference/ref-this-text-should-be-green/
>
> http://test.csswg.org/source/css21/reference/ref-this-text-should-be-green.xht
>
> So, why re-create 2 reference files that already exist?


Because if there's 300 tests with "there should be no red below" and 200
with "there should be no red", it can easily be quite hard to remove the
word "below", because removing the word can cause later content to reflow
and the test to then fail. In the first instance, we should just try to
automate all tests *as is*, rather than modifying them, as this is less
work. Longer term we can try and reduce the number of references—but having
more references isn't really that much of a problem. (Server load isn't
really an issue, because the number of requests on test.csswg.org isn't
that great—browsers run local copies, and normally only load each render
*once* for the whole testsuite.)


> This isn't perfect either, but it's more useful than "CSS 2.1 Reference",
>> IMO.
>>
>> b) We don't need author metadata on any new tests, because that metadata
>> is
>> stored in git/hg. (It's essentially been entirely redundant since we moved
>> away from SVN, as git/hg can store arbitrary authorship data regardless of
>> whether the author has source-tree access.)
>>
>
> There is an owner and an author field in Shepherd...


This just comes from the <link>, as far as I know. It wouldn't be hard to
modify Shepard to also get it from hg.


> c) We haven't actively been adding reviewer metadata for quite a while. I
>> suggest if we *really* want reviewer metadata (which I'm not at all sure
>> we
>> do—a single file may be reviewed by multiple people, especially in the
>> testharness.js case), we do it in the commit description (along the lines
>> of Signed-Off-By in the Linux repo). On the whole, I suggest we just go by
>> the assumption that anything in the repo has been reviewed (at the current
>> time outwith work-in-progress and vendor-imports),
>>
>
> I strongly disagree. Tests not identified as reviewed should not be
> considered reviewed.
>
> You can not presume or postulate that any/all test authors have the same
> CSS knowledge, depth of CSS experience. Same thing with test reviewers.


While true, the policies used in web-platform-tests have, as far as one can
tell, worked every bit as well as the CSS policies—by default, anyone can
review tests (and it's just done on the trust that people will only review
tests they are competent to review), and those with write access are meant
to quickly check there's nothing crazy before reviewing. We've not had
*any* problems with this policy.

As pointed out, we don't currently require any metadata saying a test has
been reviewed. What is the benefit of requiring the metadata? I don't see
what the benefit of having something saying it is compared with just
relying on everything in the repository being reviewed.


> and don't bother storing
>> the metadata. It doesn't really matter—when do we need to know who
>> reviewed
>> the test?
>>
>
> When a test involving a CSS module has been reviewed by one of its CSS
> spec editors, then such test becomes more trustworthy. Reviewing a test by
> a competent and independent party is important.
>

But they *don't* review tests, and we aren't going to get them to (because
we haven't managed to for years). If we want to require spec editors to
review tests for their specs, we simply won't have a test suite, and that's
useless. It'd be better to have a larger test suite, because at the end of
the day we can cope with a few bad tests.


> There is such a thing as Quality control, Quality assurance,
> de-subjectivizing the evaluation of your own work. It's even a fundamental
> principle in justice. I would need more time to explain all this.


The experience in web-platform-tests (and this echos my experiences around
several browser vendors too) is that the real way you get good feedback on
a test is someone looking into why it fails: reviews (even from good
reviewers!) frequently miss things.


> The current model can be misleading, when the test is changed
>> there's still a "reviewed" link, but that person hasn't necessarily
>> reviewed the edited test.
>>
>> d) Specification links I'm kinda unconvinced by, but I don't really care
>> enough to argue over. I know Shepard uses it.
>>
>
> How well can we know if a test suite coverage is good ... if specification
> links are not edited in tests?


The other thread is discussing this; I'll leave that to the other thread.


> e) Requirement flags I feel we should really revisit. We want to have
>> enough flags to be able to run the tests, especially in CI. I'm not so
>> interested in running tests through Shepard's UI, because I simply think
>> it's not valuable—it's almost never done, because it's pointlessly slow.
>> Browsers we should aim at getting them run in CI systems (probably with
>> some way to upload results to Shepard so we can have the CR-exit-criteria
>> views there), and minor UAs also likely can run the tests in a more
>> efficient way (as you want to determine pass/fail by unique screenshot,
>> not
>> looking at a thousand tests all of which say, "This text should be green"
>> identically).
>>
>> So:
>>
>> * ahem — we should simply just state "the CSS test suite requires the Ahem
>> font to be available" and get rid of this flag
>> * animated — this is good because it has a real use (excluding tests from
>> automated testing)
>> * asis — this makes sense with the current build system
>> * combo — do we actually care? is anyone doing anything with this? In CI
>> systems you likely want to run all the files, combo and not.
>>
>
> Personally, I would drop the combo flag. If I recall correctly, this was
> introduced in relation to this CSS2.1 test:
>
> http://test.csswg.org/suites/css2.1/20110323/html4/dynamic-top-change-005.htm
>
> * dom — sure, I suppose. not very useful for actual browsers, to be fair,
>> so just extra overhead to release tests.
>> * font — we should simply just state "the CSS test suite requires these
>> fonts to be installed" and get rid of this flag
>>
>
> I disagree. I am convinced more fonts should be added into the special
> fonts package for CSS testing: this is already certainly true for CSS
> Writing Modes and probably for tests involving text writing with various
> dominant baseline (hanging, central, etc).
> Not every web-aware softwares support embedded fonts and I believe we
> should not rely on embedded fonts anyway.


You're agreeing with me—that's what I meant. (That we just have a package
of fonts and we require it be installed for *all* tests instead of *some*
tests.)


> * history — is there any UA that *doesn't* support session history? Yes, in
>> *theory* one could exist, but if we don't know of one, we shouldn't
>> optimise for it. The cost of metadata is too high (yes, even a flag!).
>> * HTMLonly — why do we have this rather than just using asis and HTML
>> source files?
>> * http — we should just move to using the same mechanism as
>> web-platform-tests for HTTP headers (rather than .htaccess), and then this
>> can statically be determined by whether test.html.headers exists for a
>> given test.html, leaving less metadata.
>> * image — like history, is this not just for a hypothetical UA?
>> * interact — sure.
>> * invalid — do we need an actual flag for this? I presume we want this for
>> lint tools, in which case we should probably have a better (generic) way
>> to
>> silent bogus lint rules for a given test.
>> * may — on the whole reasonable
>> * namespace — is this not just for a hypothetical UA?
>> * nonHTML — can we not just use asis?
>>
>
> Yes. nonHTML can be merged with asis.
>
> * paged — sure
>> * scroll — sure
>> * should — same as may
>> * speech — um, I guess
>> * svg — do we still want to treat SVG as something not universal?
>> * userstyle — sure
>> * 32bit — is this not just for a hypothetical UA at this point?
>>
>
> Some browsers only support 16bit counter and z-index values (65536); some
> others support 32bit counter. So, 32bit flag was introduced for that reason.
>
> http://test.csswg.org/suites/css2.1/latest/html4/chapter-12.html#s12.4
>
> http://test.csswg.org/suites/css2.1/latest/html4/chapter-9.html#s9.9.1



I'm not aware of any actively developed browser that uses anything but
32-bit values (and the web, for better or for worse, relies on this
behaviour, such that it seems likely that any future web browser will
follow this).


> * 96dpi — is this not required by CSS 2.1 now, and hence redundant? (96px =
>> 1in per CSS 2.1)
>>
>> I feel like we shouldn't add metadata for hypothetical UAs that may or may
>> not exist in the future. It adds overhead to contributing to the test
>> suite, we're likely to end up with the flags being missing all over the
>> place. (We end up with flags needed for CI (animated, interact, userstyle,
>> paged, speech) missing all over the place of the ones needed to run tests
>> in existing browsers, when they support all the optional features we have
>> flags for!)
>>
>> Also, I wonder if we should just merge "animated", "interact", and
>> "userstyle" into one? "userstyle" can probably be justified as CI tools
>> can
>> run those tests in an automated manner by having some metadata within the
>> tool to set the stylesheet (do we want to add a <link
>> rel="user-stylesheet"> so that we can have required user stylesheets in
>> testinfo.data?). "animated" only makes sense to split out from "interact"
>> if anyone is ever going to verify animated content automatically.
>>
>> For the sake of CI, we essentially have a few categories of tests:
>>
>>  * reftests
>>  * visual tests that can be verified by screenshots
>>  * testharness.js tests
>>  * any of the three above that need special setup in the CI tool (a
>> non-standard browser window size, a user stylesheet needing set,
>> potentially paged media though I'm not sure if anyone actually runs paged
>> media tests in CI tools?)
>>  * manual tests that cannot be run in CI
>>
>> Really that gives us seven types of tests, six of which can run in CI. The
>> first two above (so four of the six) can be distinguished by the presence
>> of link[@rel='match' or @rel='mismatch']. We could distinguish
>> testharness.js tests (two more of the six) by the presence of
>> script[@src='skins/classic/resources/testharness.js']. This means the
>> only things we
>> actually need to be able to filter out by explicit metadata are "needs
>> special CI setup" and "entirely manual". We probably want enough
>> granularity in the metadata such that people can trivially get a list of
>> tests that need special CI setup based upon what their CI supports (e.g.,
>> if it can set a user stylesheet but can't run paged media).
>>
>> We should discuss what we actually need as a result of this and get rid of
>> the rest.
>>
>> f) Test assertions… are we doing anything we this, beyond what we use
>> specification links for? If not, we should stop requiring them.
>>
>
> Test assertions are often difficult for test authors. But it is valuable
> for reviewers, test understanding (when you revisit 6 months later one of
> your own test which could be a complex test) and test coverage. The need
> for test assertions would be reduced if test authors would at least avoid
> meaningless, unhelpful, unsemantic id, class, function identifiers and
> undifferentiated characters in their tests.
>

Requiring actual assertions will practically guarantee we don't get tests
from browser vendors. Yes, I think nobody is denying that in an ideal world
in an ideal test suite we'd have clearer, better commented tests. To give a
somewhat known quote: "Give them the third best to go on with; the second
best comes too late, the best never comes". We're better off working
towards what we can actually achieve, rather than a hypothetical better
test suite that may well never happen.

/Geoffrey
Received on Thursday, 29 October 2015 06:05:50 UTC