Re: Testing parse-html from Dimitre Novatchev on 2022-12-23 (public-xslt-40@w3.org from December 2022)

From: Dimitre Novatchev <dnovatchev@gmail.com>
Date: Fri, 23 Dec 2022 10:44:30 -0800
To: Sasha Firsov <suns@firsov.net>
Cc: Michael Kay <mike@saxonica.com>, "public-xslt-40@w3.org" <public-xslt-40@w3.org>
Message-ID: <CAK4KnZeT0i+zq+NwGrWkGLg0VKiGdfmb_RUKi3oTZcD-uh-XMQ@mail.gmail.com>
What about using a web-crawler and saving html documents that have a
diversity of characteristics that compliment each other, and united as a
whole may be regarded as a reasonably complete representation of the
**real** Html Universe?

Also, giving the different Html documents a "frequency weight" that would
be bigger for documents that are being requested more frequently in a given
period of time. Thus, it may be wise not to spend any effort on a document
that is being requested  in one-millionth of one percent of all the time,
regardless what "precious" insights this document could give us. To put it
simpler: avoid edge cases which are negligible in occurrence.

BTW, ignoring javascript leads to being incognizant of a significant
portion of Html in the browsers -- the Html that is generated dynamically
by the javascript of the initially-loaded pages. This demonstrates a
significant risk of not representing in the test-data a considerable
portion of the Html documents that the web-browsers already have to deal
with on a daily basis at present.

Thanks,
Dimitre

On Fri, Dec 23, 2022 at 9:58 AM Sasha Firsov <suns@firsov.net> wrote:

> > (a) Find a useful set of HTML files (around 1000, ideally)
> There is a "brutal force" approach. If you can not get the set sufficient
> for the goal, take the superset which is guaranteed to cover all.
> Chromium sources have ~90K html files, most reside in test folders. I
> guess Firefox would have about the same. The old W3C set is still valid but
> not current.
> The union of all 3 would give you the assurance of integrity. Of course it
> is time consuming, but since the final result is needed only during
> release, perhaps justified.
> -s
> [image: image.png]
>
> On Thu, Dec 22, 2022 at 3:24 PM Michael Kay <mike@saxonica.com> wrote:
>
>> I guess one approach might be:
>>
>> (a) Find a useful set of HTML files (around 1000, ideally) that exhibit
>> the right range of markup characteristics - remembering that we're not
>> interested in script variations or CSS variations or interactive behavoiur,
>> only in markup
>>
>> (b) Put these through a test generator based on Henri Sivonen's HTML5
>> parser, to generate (for each one) an XML document that has the same XDM
>> representation as the HTML.
>>
>> (c) Have the test driver compare the XDM produced by parse-html() on the
>> original document with the XDM produced by parse-xml() on the equivalent
>> XML.
>>
>> I'm still not sure how best to achieve (a).
>>
>> This isn't ideal, because we're testing against a trusted implementation
>> rather than against the specification. And it gets circular if the actual
>> product-under-test is using the same HTML5 parser that was used to
>> construct the tests. But it's a potential way forward.
>>
>> Michael Kay
>> Saxonica
>>
>> On 22 Dec 2022, at 22:23, Sasha Firsov <suns@firsov.net> wrote:
>>
>> Michael,
>> Not a real answer but could cover half of the needs.
>>
>> The test suite has a set of test samples and results to compare against.
>> The second can be achieved by feeding the input string to actual DOM engine
>> (Chromium/Blink) and comparing your own parser results with the actual
>> browser DOM. Going further, by utilizing *cross-browser testing* capabilities like
>> from @web/test-runner-playwright, you would have a *parser browser
>> compatibility matrix*.
>>
>> While the approach is not a test against "ideal" standards, it is more
>> valuable in the web development world as shows the cross-browser support, a
>> criteria to accept any JS library. The browsers themselves are not
>> following W3C test suites anymore:
>> > Blink does not currently (4/2013) regularly import and run the W3C's
>> tests
>> <https://www.chromium.org/blink/blink-testing-and-the-w3c/#ideal-state>
>>
>> As for the 1st half of the question on the parser test set, the
>> Chromium(Blink) or FF sources have the parser tests in the sources.
>> Extraction of those is a bit of a challenge though.
>> Blink sources are in chromium repo
>> <https://github.com/chromium/chromium/blob/main/third_party/blink/renderer/core/html/parser>.
>> Do not use `git clone` as it is 30+Gb, get the latest instead:
>>
>> https://github.com/chromium/chromium/archive/refs/heads/main.zip
>>
>> -s
>>
>> PS I have used such approach for testing of TEMPLATE
>> <https://github.com/chromium/chromium/blob/main/third_party/blink/web_tests/external/wpt/shadow-dom/slots.html#L8>
>> tag shadowDOM simulation in css-chain
>> <https://github.com/sashafirsov/css-chain-test/blob/main/src/slots-light-vs-shadow.html>
>> and light-dom-element tests.
>>
>> On Wed, Dec 21, 2022 at 4:06 PM Michael Kay <mike@saxonica.com> wrote:
>>
>>> I've just been running a few new tests on our existing parse-html()
>>> function on SaxonJ (built on TagSoup) and SaxonCS (built on
>>> HtmlAgilityPack) and reallising how different they are. I suspect that
>>> getting a good level of interoperability (and tests to prove it) for
>>> fn:parse-html is going to be challenging!
>>>
>>> Is there an HTML5 test suite we can build on?
>>>
>>> Michael Kay
>>> Saxonica
>>>
>>
>>
Attachments

image/png attachment: image.png
Received on Friday, 23 December 2022 18:44:56 UTC