Re: Testing parse-html

> (a) Find a useful set of HTML files (around 1000, ideally)
There is a "brutal force" approach. If you can not get the set sufficient
for the goal, take the superset which is guaranteed to cover all.
Chromium sources have ~90K html files, most reside in test folders. I guess
Firefox would have about the same. The old W3C set is still valid but not
current.
The union of all 3 would give you the assurance of integrity. Of course it
is time consuming, but since the final result is needed only during
release, perhaps justified.
-s
[image: image.png]

On Thu, Dec 22, 2022 at 3:24 PM Michael Kay <mike@saxonica.com> wrote:

> I guess one approach might be:
>
> (a) Find a useful set of HTML files (around 1000, ideally) that exhibit
> the right range of markup characteristics - remembering that we're not
> interested in script variations or CSS variations or interactive behavoiur,
> only in markup
>
> (b) Put these through a test generator based on Henri Sivonen's HTML5
> parser, to generate (for each one) an XML document that has the same XDM
> representation as the HTML.
>
> (c) Have the test driver compare the XDM produced by parse-html() on the
> original document with the XDM produced by parse-xml() on the equivalent
> XML.
>
> I'm still not sure how best to achieve (a).
>
> This isn't ideal, because we're testing against a trusted implementation
> rather than against the specification. And it gets circular if the actual
> product-under-test is using the same HTML5 parser that was used to
> construct the tests. But it's a potential way forward.
>
> Michael Kay
> Saxonica
>
> On 22 Dec 2022, at 22:23, Sasha Firsov <suns@firsov.net> wrote:
>
> Michael,
> Not a real answer but could cover half of the needs.
>
> The test suite has a set of test samples and results to compare against.
> The second can be achieved by feeding the input string to actual DOM engine
> (Chromium/Blink) and comparing your own parser results with the actual
> browser DOM. Going further, by utilizing *cross-browser testing* capabilities like
> from @web/test-runner-playwright, you would have a *parser browser
> compatibility matrix*.
>
> While the approach is not a test against "ideal" standards, it is more
> valuable in the web development world as shows the cross-browser support, a
> criteria to accept any JS library. The browsers themselves are not
> following W3C test suites anymore:
> > Blink does not currently (4/2013) regularly import and run the W3C's
> tests
> <https://www.chromium.org/blink/blink-testing-and-the-w3c/#ideal-state>
>
> As for the 1st half of the question on the parser test set, the
> Chromium(Blink) or FF sources have the parser tests in the sources.
> Extraction of those is a bit of a challenge though.
> Blink sources are in chromium repo
> <https://github.com/chromium/chromium/blob/main/third_party/blink/renderer/core/html/parser>.
> Do not use `git clone` as it is 30+Gb, get the latest instead:
>
> https://github.com/chromium/chromium/archive/refs/heads/main.zip
>
> -s
>
> PS I have used such approach for testing of TEMPLATE
> <https://github.com/chromium/chromium/blob/main/third_party/blink/web_tests/external/wpt/shadow-dom/slots.html#L8>
> tag shadowDOM simulation in css-chain
> <https://github.com/sashafirsov/css-chain-test/blob/main/src/slots-light-vs-shadow.html>
> and light-dom-element tests.
>
> On Wed, Dec 21, 2022 at 4:06 PM Michael Kay <mike@saxonica.com> wrote:
>
>> I've just been running a few new tests on our existing parse-html()
>> function on SaxonJ (built on TagSoup) and SaxonCS (built on
>> HtmlAgilityPack) and reallising how different they are. I suspect that
>> getting a good level of interoperability (and tests to prove it) for
>> fn:parse-html is going to be challenging!
>>
>> Is there an HTML5 test suite we can build on?
>>
>> Michael Kay
>> Saxonica
>>
>
>

Received on Friday, 23 December 2022 17:58:16 UTC