Re: Testing parse-html from Reece Dunn on 2022-12-28 (public-xslt-40@w3.org from December 2022)

From: Reece Dunn <msclrhd@googlemail.com>
Date: Wed, 28 Dec 2022 09:49:47 +0000
To: Michael Kay <mike@saxonica.com>
Cc: Jirka Kosek <jirka@kosek.cz>, "public-xslt-40@w3.org" <public-xslt-40@w3.org>
Message-ID: <CAGdtn25uEkwJk0x5YTqbT=vqgZ82jxx9f8B_BC-Kd0aaLkyuQQ@mail.gmail.com>

Hi,

I'm wondering if it would make sense to construct a set of tests that:
a) exercise the different parts of the HTML parser algorithm;
b) exercise the different parts of the HTML tree construction algorithm
(e.g. the addition of a missing html element);
c) exercise the various HTML entities;
d) exercise the various void elements;
e) cover the various html elements.

Note: For HTML5 support, I've used JSoup (https://github.com/jhy/jsoup) in
various projects.

- Reece

On Tue, 27 Dec 2022 at 23:10, Michael Kay <mike@saxonica.com> wrote:

> To start with, I've cloned the HTML5 test repository and found it has
> 47272 files with suffix HTML.
>
> Giving each of these a signature formed by taking the first three
> characters of each start tag, there are 13787 distinct signatures.
>
> I've divided the files into groups by signature, I'm taking the first file
> in every tenth group, giving a set of 1378 test files.
>
> I've then tried parsing these files and converting to an XDM using
> saxon:parse-html() on both SaxonJ (using TagSoup) and SaxonCS (using
> HtmlAgilityPack), to establish reference results. However, the results are
> inadequate. Many of the tests take advantage of HTML5 tag omission (e.g.
> omitting the outer `html` tag) and neither of the existing implementations
> can cope with this. So I'm going to have to write a better HTML5->XML
> converter in order to generate the reference results -- and unfortunately,
> at that point, the test will likely become self-fulfilling. But a second
> implementation running the tests should confirm that they're OK.
>
> Michael Kay
> Saxonica
>
> > On 23 Dec 2022, at 22:31, Jirka Kosek <jirka@kosek.cz> wrote:
> >
> > On 22.12.2022 1:06, Michael Kay wrote:
> >> I've just been running a few new tests on our existing parse-html()
> function on SaxonJ (built on TagSoup) and SaxonCS (built on
> HtmlAgilityPack) and reallising how different they are. I suspect that
> getting a good level of interoperability (and tests to prove it) for
> fn:parse-html is going to be challenging!
> > Hi,
> >
> > I think it would be good to have parsing consistent with web browsers
> which means implementing HTML5 parsing algorithm. I have been using the
> following parser when I needed to process HTML5 input by XSLT:
> >
> > https://about.validator.nu/htmlparser/
> >
> > Perhaps switching to this parser from TagSoup would give better results
> if some other HTML5 compliant parser would be used in .NET product as well.
> >
> >                               Jirka
> >
> > --
> > ------------------------------------------------------------------
> >  Jirka Kosek      e-mail: jirka@kosek.cz      http://xmlguru.cz
> > ------------------------------------------------------------------
> >     Professional XML and Web consulting and training services
> > DocBook/DITA customization, custom XSLT/XSL-FO document processing
> > ------------------------------------------------------------------
> >    Bringing you XML Prague conference    http://xmlprague.cz
> > ------------------------------------------------------------------
>
>
>

Received on Wednesday, 28 December 2022 09:50:10 UTC