Re: Testing parse-html

I guess one approach might be:

(a) Find a useful set of HTML files (around 1000, ideally) that exhibit the right range of markup characteristics - remembering that we're not interested in script variations or CSS variations or interactive behavoiur, only in markup

(b) Put these through a test generator based on Henri Sivonen's HTML5 parser, to generate (for each one) an XML document that has the same XDM representation as the HTML.

(c) Have the test driver compare the XDM produced by parse-html() on the original document with the XDM produced by parse-xml() on the equivalent XML.

I'm still not sure how best to achieve (a).

This isn't ideal, because we're testing against a trusted implementation rather than against the specification. And it gets circular if the actual product-under-test is using the same HTML5 parser that was used to construct the tests. But it's a potential way forward.

Michael Kay
Saxonica

On 22 Dec 2022, at 22:23, Sasha Firsov <suns@firsov.net<mailto:suns@firsov.net>> wrote:

Michael,
Not a real answer but could cover half of the needs.

The test suite has a set of test samples and results to compare against. The second can be achieved by feeding the input string to actual DOM engine (Chromium/Blink) and comparing your own parser results with the actual browser DOM. Going further, by utilizing cross-browser testing capabilities like from @web/test-runner-playwright, you would have a parser browser compatibility matrix.

While the approach is not a test against "ideal" standards, it is more valuable in the web development world as shows the cross-browser support, a criteria to accept any JS library. The browsers themselves are not following W3C test suites anymore:
> Blink does not currently (4/2013) regularly import and run the W3C's tests<https://www.chromium.org/blink/blink-testing-and-the-w3c/#ideal-state>

As for the 1st half of the question on the parser test set, the Chromium(Blink) or FF sources have the parser tests in the sources. Extraction of those is a bit of a challenge though.
Blink sources are in chromium repo<https://github.com/chromium/chromium/blob/main/third_party/blink/renderer/core/html/parser>. Do not use `git clone` as it is 30+Gb, get the latest instead:
https://github.com/chromium/chromium/archive/refs/heads/main.zip
-s

PS I have used such approach for testing of TEMPLATE<https://github.com/chromium/chromium/blob/main/third_party/blink/web_tests/external/wpt/shadow-dom/slots.html#L8> tag shadowDOM simulation in css-chain<https://github.com/sashafirsov/css-chain-test/blob/main/src/slots-light-vs-shadow.html> and light-dom-element tests.

On Wed, Dec 21, 2022 at 4:06 PM Michael Kay <mike@saxonica.com<mailto:mike@saxonica.com>> wrote:
I've just been running a few new tests on our existing parse-html() function on SaxonJ (built on TagSoup) and SaxonCS (built on HtmlAgilityPack) and reallising how different they are. I suspect that getting a good level of interoperability (and tests to prove it) for fn:parse-html is going to be challenging!

Is there an HTML5 test suite we can build on?

Michael Kay
Saxonica

Received on Thursday, 22 December 2022 23:25:10 UTC