RE: Conversion of MS CSS 2.1 tests to reftests from Gérard Talbot on 2010-09-22 (public-css-testsuite@w3.org from September 2010)

From: Gérard Talbot <css21testsuite@gtalbot.org>
Date: Wed, 22 Sep 2010 08:24:53 -0700
To: "John Jansen" <John.Jansen@microsoft.com>
Cc: "public-css-testsuite@w3.org" <public-css-testsuite@w3.org>
Message-ID: <ab06f5ce84e0259707abc4b9d50f5223.squirrel@cp3.shieldhost.com>
>> -----Original Message-----
>> From: "Gérard Talbot" [mailto:css21testsuite@gtalbot.org]
>> Sent: Tuesday, September 21, 2010 1:33 PM
>> To: public-css-testsuite@w3.org
>> Cc: John Jansen
>> Subject: RE: Conversion of MS CSS 2.1 tests to reftests
>> Hello John,
>> > Please note, that I ran the entire suite for the first time last
>> summer
>> Last summer? You mean this summer 2010.. or summer 2009? There is a
difference of several thousands of testcases if we're talking of
summer
>> 2009 versus summer 2010 here.
> Summer 2010 using an IE9 build.
>> > and it took me three days of interrupted time (NOT non-interrupted
>> time).
>> 3 days to run how many testcases? How many seconds (avg) per
testcase?
> Well, ~9500. The tests that are there.  I think that works out to just
under 8 seconds a test overall if we say I worked 7 hours a day (seems
about right: Thursday, Friday, Saturday).
>> > I just now ran 20 tests from the HTML suite and it took me 24
>> seconds.
>> I have huge difficulties understanding how you can run 20 testcases
manually
>> in 24 seconds.
> I'm not responding to this comment, as I am not lying.
>> In order to run, say, 20 testcases, you need to do at minimum 39 (2n -1)
>> mouseclicks. The testcases are not arranged, not coded with <link
rel="next">. So you have to click the back button to get back to the
list of
>> testcases to click the link of next testcase. And you have to read
the
>> pass/fail
>> conditions too of each testcase.
> I have no idea why you need to browse back to the test case when you
can
> easily download the zipped files and run them, or build a simple
harness
> in script to help you. You could write an htm file that is generated
by
> script writing a for-each file in the unzipped folder. I personally
just
> had two monitors up and loaded the tests from the folder on one
monitor
> into the browser on the other.
>> No testcase is http-prefetchable (coded with <link rel="prefetch">).
There are many testcases which require to compare with a reference
test.
>> So, at least, 2 extra clicks.
> None of the 20 I ran needed me to do any additional clicks.
>> > I am not saying this is typical, necessarily, and when you hit a
>> failure
>> > it certainly adds time, but I think that looking at an 11 second
>> average
>> > seems very high in practice.
>> > -John
>> I took the test harness in January 2010 (511 tests) and I mentioned
this in
>> http://lists.w3.org/Archives/Public/public-css-testsuite/2010Jan/0043.html
and my results (I was using Konqueror 4.x) are still available,
accessible,
>> viewable here:
>> http://www.w3.org/2008/07/test-harness-css/results.php?s=htm4&o=0 and
it took me 4 hours to run the 511 tests.
> That is very surprising. I suspect you were evaluating each test for
accuracy as you ran them,


Hello John,

I must have checked/examined a few, yes. I know for a fact that I could
not say either pass or fail in 30 testcases. Those 30 testcases are
each/all identified+identifiable in
http://www.w3.org/2008/07/test-harness-css/results.php?s=htm4&o=0
(as I was using Konqueror) and I listed the font ones in
http://lists.w3.org/Archives/Public/public-css-testsuite/2010Jan/0043.html


> rather than simply logging a pass/fail.

The logging of a pass/fail result was done with the buttons at the
bottom of the test harness page. The pass/fail results were logged and
are in that aforementioned page
http://www.w3.org/2008/07/test-harness-css/results.php?s=htm4&o=0

> You
> are saying on average 28 seconds per test; I have huge difficulties
understanding that number.
>> There are other testcases situations which will slow down testers - a
bunch of testcases require to download and install a custom font and
then
>> to uninstall it
> Yep, did it.
>> - a bunch of testcases require to download and install an user style
sheet
>> - a bunch of testcases require to read more than 1 sentence
>> - a bunch of testcases have small or very small lines, squares as
pass/fail
>> conditions
> Yes, they do.
>> - a bunch of testcases have awkward wording of pass/fail conditions
or
>> inappropriate shape description of expected result (causing
confusion,
>> hesitation)
>> - a good bunch of testcases require to compare width or height of 2
squares.
>> If quality (over speed) of testing is more important, if testers have
more than
>> a "It's good enough" sense of quality/QA policy, then they may report
a few
>> more FAILED tests after stopping+spending a few more seconds. E.g.
http://test.csswg.org/suites/css2.1/20100917/html4/html-attribute-019.htm
"there is no space between the green and blue boxes" is not the same
as the
>> green square partially *overlapping* the blue square.
> Yep, I actually had a tri-state approach: Pass/Fail/???. I went
through
> the whole suite. After I was done, I went back to the harder ones to
evaluate, took my time with them, and for any final questions, I met
with Arron to discuss.
>> ==============
>> "I do not think it is worth it to try rushing to REC while the test
suite is in the
>> state it is in."
>> Anne van Kesteren
>> I very much agree with Anne van Kesteren's opinion here.
> I'm not a fan of rushing to REC either. I have no idea why it's
> September 21st and it seems like very few people have been running the
tests that have been up there for months if not years.


The first inclusion of testcases from Microsoft was, according to IE
blog, on March 6th 2008 and it was a batch of 700 testcases. The biggest
batch was 3784 testcases on January 27th 2009 according to IE blog. So,
many months: yes. Many years: I would not say so.



> We have had a
> plan in place, we discussed in January, at the spring F2F, and then
got
> concrete agreement in Oslo. October 15th was the agreed upon date.

<shrug> I can not speak about this spring Oslo F2F meeting-agreement.


>> -------
>> There are wrong testcases in the test suite; not many... hopefully. E.g:
>> http://test.csswg.org/suites/css2.1/20100917/html4/position-relative-
nested-001.htm
> Yep, those issues will be revealed as people continue to review the
suite, and should be raised as issues. Like any process for locking
down, you evaluate the incoming feedback as it comes. Locking down
means
> reducing churn. I think we all want to lock down 2.1, and doing so
requires Implementation Reports against the test suite.


The false Failed will be detected, discussed and addressed/fixed rather
soon IMO: an excellent example of this is David Baron's first 6 emails
wrt
specific testcases. His opinion was that the 6 testcases mentioned in
his emails were false Failed.

The false Passed or the wrong testcases are quite different. If I may
use such comparison/analogy, it will be rather easy and/or fast to point
out the oranges in this big bag of apples but it will be considerably
more difficult to corner/isolate/identify the bad apples, the apples
with a worm inside, the partially rotten apples. You'll need to taste
them a bit or dissect them a bit.


>> -------
>> There are false positive testcases:
>> http://test.csswg.org/suites/css2.1/20100917/html4/padding-right-applies-
to-013.htm
>> is a wrong testcase which all testers (regardless of browser actually
testing) would/will report as a PASSED test.
>> -------
>> Some are false negative testcases:
>> http://test.csswg.org/suites/css2.1/20100917/html4/vertical-align-115.htm
http://test.csswg.org/suites/css2.1/20100917/html4/vertical-align-116.htm
-------
>> Some are inaccurately coded testcases. E.g.
>> a few (many?) *-applies-to-010 (involving 'display: list-item'). If
the tester
>> does not see a bullet list-marker, then the testcase should be marked as
>> FAILED. The thing is that there are still testcases which do not say
that a bullet
>> list marker should be visible and are still inappropriately coded
which makes
>> them hidden (outside the viewport).
>> E.g.
>> http://test.csswg.org/suites/css2.1/20100917/html4/padding-top-applies-to-
010.htm
>> ------
>> Some testcases are not robust testcases or stringent testcases: eg
http://test.csswg.org/suites/css2.1/20100917/html4/right-offset-
percentage-001.htm
>> If you change (or remove) 'position: absolute' and make it static,
the
>> testcase
>> still passes. If you change 'right: 50%' to 'right: auto', the
>> testcase still passes
>> anyway.
>> ------
>> A good bunch of Microsoft submitted testcases have unnecessary (or
unjustified or unneeded) declarations (eg height: 0; border-collapse:
collapse; dir: rtl; position: absolute;) or extraneous div
containers.
>> It is not a reason to reject them but... it is a reason to believe
that such
>> testcases are not best and that the test suite could be improved.
> Do any of the above comments mean you cannot submit an implementation
report?


No. None of the above comments mean I cannot submit an implementation
report. But those comments justify to not have a complete blind faith in
the Implementation Report results, blind (trust) confidence in all of
the testcases that are passed.

Submitting an implementation report does not improve in any way the
quality, the trustworthiness, the reliability of any testcases.
You can take a car for an impromptus road test; it does not mean the car
mechanic/technician thinks it's a good idea.. he may be pulling his hair
with anxiety.

At some point, you will have to review the testcases and control their
intrinsec quality, robustness, accuracy, reliability, etc.
Ideally/preferably, you would want to review all of the testcases before
submitting an implementation report.., wouldn't you?


>> ------
>> Some are not very relevant testcases: if a testcase is passed when
CSS
>> support is disabled, then such testcase's relevance is rather
limited,
>> otherwise questionable. Ideally, you would want all testcases to fail
when
>> using Lynx 2.8.5 or NS3 or a non-capable CSS browser.
>> ------
>> Some sections are under-tested (e.g. several sub-sections of section
10.3) while some others are IMO over-tested. Did you know that there
are
>> over 600 testcases testing the zero value (+-signed and unsigned; for
the 9
>> different units; for many properties).
> A lot of times browsers have implemented different rounding algorithms
for different properties. Using floor for one and ceiling for another.
Regardless, though, those tests are superfast to run.
>> ---------
>> My conclusion is that automatable testing, while definitely
preferable
>> over
>> manual testing, will not do much if the testcases are not reviewed,
checked,
>> corrected or adjusted accordingly to begin with. You first want to have
>> reliable, trustworthy, accurately designed testcases before creating
reftests
>> or labelling correspondent screenshots.
> The ask from the w3c here is to submit an Implementation Report
against
> the current test suite. If there are issues with the tests that need
to
> be submitted, then they should be submitted


There are issues with some tests.

I have submitted the issues I found.

In some cases, twice in the mailing list and before RC1.


> and the working group should
> evaluate them on their value-add to the suite and then the appropriate
action should be taken.


I really do not mind or oppose my reports being evaluated, scrutinized
as well. I have no problem with such protocols or reciprocity.

regards, Gérard
-- 
Contributions to the CSS 2.1 test suite:
http://www.gtalbot.org/BrowserBugsSection/css21testsuite/

CSS 2.1 test suite (RC1; September 17th 2010):
http://test.csswg.org/suites/css2.1/20100917/html4/toc.html

CSS 2.1 test suite contributors:
http://test.csswg.org/source/contributors/
Received on Wednesday, 22 September 2010 15:26:02 UTC