RE: Conversion of MS CSS 2.1 tests to reftests from John Jansen on 2010-09-21 (public-css-testsuite@w3.org from September 2010)

From: John Jansen <John.Jansen@microsoft.com>
Date: Tue, 21 Sep 2010 22:49:23 +0000
To: "css21testsuite@gtalbot.org" <css21testsuite@gtalbot.org>, "public-css-testsuite@w3.org" <public-css-testsuite@w3.org>
Message-ID: <C340671BECD4364E8F9EBA27E8E2313219D2D4B7@DF-M14-04.exchange.corp.microsoft.com>
> -----Original Message-----
> From: "Gérard Talbot" [mailto:css21testsuite@gtalbot.org]
> Sent: Tuesday, September 21, 2010 1:33 PM
> To: public-css-testsuite@w3.org
> Cc: John Jansen
> Subject: RE: Conversion of MS CSS 2.1 tests to reftests
> 
> Hello John,
> 
> > Please note, that I ran the entire suite for the first time last
> summer
> 
> Last summer? You mean this summer 2010.. or summer 2009? There is a
> difference of several thousands of testcases if we're talking of summer
> 2009 versus summer 2010 here.
>
Summer 2010 using an IE9 build.
 
> > and it took me three days of interrupted time (NOT non-interrupted time).
> 
> 3 days to run how many testcases? How many seconds (avg) per testcase?
> 

Well, ~9500. The tests that are there.  I think that works out to just under 8 seconds a test overall if we say I worked 7 hours a day (seems about right: Thursday, Friday, Saturday). 

> > I just now ran 20 tests from the HTML suite and it took me 24 seconds.
> 
> I have huge difficulties understanding how you can run 20 testcases manually
> in 24 seconds.
> 

I'm not responding to this comment, as I am not lying.

> In order to run, say, 20 testcases, you need to do at minimum 39 (2n -1)
> mouseclicks. The testcases are not arranged, not coded with <link
> rel="next">. So you have to click the back button to get back to the list of
> testcases to click the link of next testcase. And you have to read the pass/fail
> conditions too of each testcase.

I have no idea why you need to browse back to the test case when you can easily download the zipped files and run them, or build a simple harness in script to help you. You could write an htm file that is generated by script writing a for-each file in the unzipped folder. I personally just had two monitors up and loaded the tests from the folder on one monitor into the browser on the other.

> 
> No testcase is http-prefetchable (coded with <link rel="prefetch">).
> 
> There are many testcases which require to compare with a reference test.
> So, at least, 2 extra clicks.

None of the 20 I ran needed me to do any additional clicks.

> 
> > I am not saying this is typical, necessarily, and when you hit a
> failure
> > it certainly adds time, but I think that looking at an 11 second
> average
> > seems very high in practice.
> > -John
> 
> 
> I took the test harness in January 2010 (511 tests) and I mentioned this in
> http://lists.w3.org/Archives/Public/public-css-testsuite/2010Jan/0043.html
> and my results (I was using Konqueror 4.x) are still available, accessible,
> viewable here:
> http://www.w3.org/2008/07/test-harness-css/results.php?s=htm4&o=0
> and it took me 4 hours to run the 511 tests.
>
That is very surprising. I suspect you were evaluating each test for accuracy as you ran them, rather than simply logging a pass/fail. You are saying on average 28 seconds per test; I have huge difficulties understanding that number.
 
> 
> There are other testcases situations which will slow down testers
> 
> - a bunch of testcases require to download and install a custom font and then
> to uninstall it

Yep, did it.

> - a bunch of testcases require to download and install an user style sheet
> - a bunch of testcases require to read more than 1 sentence
> - a bunch of testcases have small or very small lines, squares as pass/fail
> conditions

Yes, they do.

> - a bunch of testcases have awkward wording of pass/fail conditions or
> inappropriate shape description of expected result (causing confusion,
> hesitation)
> - a good bunch of testcases require to compare width or height of 2 squares.
> If quality (over speed) of testing is more important, if testers have more than
> a "It's good enough" sense of quality/QA policy, then they may report a few
> more FAILED tests after stopping+spending a few more seconds. E.g.
> http://test.csswg.org/suites/css2.1/20100917/html4/html-attribute-019.htm
> "there is no space between the green and blue boxes" is not the same as the
> green square partially *overlapping* the blue square.

Yep, I actually had a tri-state approach: Pass/Fail/???. I went through the whole suite. After I was done, I went back to the harder ones to evaluate, took my time with them, and for any final questions, I met with Arron to discuss.
> 
> 
> ==============
> 
> "I do not think it is worth it to try rushing to REC while the test suite is in the
> state it is in."
> Anne van Kesteren
> 
> I very much agree with Anne van Kesteren's opinion here.
> 

I'm not a fan of rushing to REC either. I have no idea why it's September 21st and it seems like very few people have been running the tests that have been up there for months if not years. We have had a plan in place, we discussed in January, at the spring F2F, and then got concrete agreement in Oslo. October 15th was the agreed upon date.

> -------
> 
> There are wrong testcases in the test suite; not many... hopefully. E.g:
> http://test.csswg.org/suites/css2.1/20100917/html4/position-relative-
> nested-001.htm
> 

Yep, those issues will be revealed as people continue to review the suite, and should be raised as issues. Like any process for locking down, you evaluate the incoming feedback as it comes. Locking down means reducing churn. I think we all want to lock down 2.1, and doing so requires Implementation Reports against the test suite.

> 
> -------
> 
> There are false positive testcases:
> 
> http://test.csswg.org/suites/css2.1/20100917/html4/padding-right-applies-
> to-013.htm
> 
> is a wrong testcase which all testers (regardless of browser actually
> testing) would/will report as a PASSED test.
> 
> -------
> 
> Some are false negative testcases:
> 
> http://test.csswg.org/suites/css2.1/20100917/html4/vertical-align-115.htm
> 
> http://test.csswg.org/suites/css2.1/20100917/html4/vertical-align-116.htm
> 
> -------
> 
> Some are inaccurately coded testcases. E.g.
> a few (many?) *-applies-to-010 (involving 'display: list-item'). If the tester
> does not see a bullet list-marker, then the testcase should be marked as
> FAILED. The thing is that there are still testcases which do not say that a bullet
> list marker should be visible and are still inappropriately coded which makes
> them hidden (outside the viewport).
> E.g.
> 
> http://test.csswg.org/suites/css2.1/20100917/html4/padding-top-applies-to-
> 010.htm
> 
> ------
> 
> Some testcases are not robust testcases or stringent testcases: eg
> http://test.csswg.org/suites/css2.1/20100917/html4/right-offset-
> percentage-001.htm
> If you change (or remove) 'position: absolute' and make it static, the testcase
> still passes. If you change 'right: 50%' to 'right: auto', the testcase still passes
> anyway.
> 
> ------
> 
> A good bunch of Microsoft submitted testcases have unnecessary (or
> unjustified or unneeded) declarations (eg height: 0; border-collapse:
> collapse; dir: rtl; position: absolute;) or extraneous div containers.
> It is not a reason to reject them but... it is a reason to believe that such
> testcases are not best and that the test suite could be improved.
> 

Do any of the above comments mean you cannot submit an implementation report?

> ------
> 
> 
> Some are not very relevant testcases: if a testcase is passed when CSS
> support is disabled, then such testcase's relevance is rather limited,
> otherwise questionable. Ideally, you would want all testcases to fail when
> using Lynx 2.8.5 or NS3 or a non-capable CSS browser.
> 
> ------
> 
> Some sections are under-tested (e.g. several sub-sections of section
> 10.3) while some others are IMO over-tested. Did you know that there are
> over 600 testcases testing the zero value (+-signed and unsigned; for the 9
> different units; for many properties).
> 

A lot of times browsers have implemented different rounding algorithms for different properties. Using floor for one and ceiling for another. Regardless, though, those tests are superfast to run.

> ---------
> 
> 
> My conclusion is that automatable testing, while definitely preferable over
> manual testing, will not do much if the testcases are not reviewed, checked,
> corrected or adjusted accordingly to begin with. You first want to have
> reliable, trustworthy, accurately designed testcases before creating reftests
> or labelling correspondent screenshots.
> 

The ask from the w3c here is to submit an Implementation Report against the current test suite. If there are issues with the tests that need to be submitted, then they should be submitted and the working group should evaluate them on their value-add to the suite and then the appropriate action should be taken. 

> regards, Gérard
> --
> Contributions to the CSS 2.1 test suite:
> http://www.gtalbot.org/BrowserBugsSection/css21testsuite/
> 
> CSS 2.1 test suite (RC1; September 17th 2010):
> http://test.csswg.org/suites/css2.1/20100917/html4/toc.html
> 
> CSS 2.1 test suite contributors:
> http://test.csswg.org/source/contributors/
>
Received on Tuesday, 21 September 2010 22:49:59 UTC