Re: Testing parse-html

To start with, I've cloned the HTML5 test repository and found it has 47272 files with suffix HTML.

Giving each of these a signature formed by taking the first three characters of each start tag, there are 13787 distinct signatures.

I've divided the files into groups by signature, I'm taking the first file in every tenth group, giving a set of 1378 test files.

I've then tried parsing these files and converting to an XDM using saxon:parse-html() on both SaxonJ (using TagSoup) and SaxonCS (using HtmlAgilityPack), to establish reference results. However, the results are inadequate. Many of the tests take advantage of HTML5 tag omission (e.g. omitting the outer `html` tag) and neither of the existing implementations can cope with this. So I'm going to have to write a better HTML5->XML converter in order to generate the reference results -- and unfortunately, at that point, the test will likely become self-fulfilling. But a second implementation running the tests should confirm that they're OK.

Michael Kay
Saxonica

> On 23 Dec 2022, at 22:31, Jirka Kosek <jirka@kosek.cz> wrote:
> 
> On 22.12.2022 1:06, Michael Kay wrote:
>> I've just been running a few new tests on our existing parse-html() function on SaxonJ (built on TagSoup) and SaxonCS (built on HtmlAgilityPack) and reallising how different they are. I suspect that getting a good level of interoperability (and tests to prove it) for fn:parse-html is going to be challenging!
> Hi,
> 
> I think it would be good to have parsing consistent with web browsers which means implementing HTML5 parsing algorithm. I have been using the following parser when I needed to process HTML5 input by XSLT:
> 
> https://about.validator.nu/htmlparser/
> 
> Perhaps switching to this parser from TagSoup would give better results if some other HTML5 compliant parser would be used in .NET product as well.
> 
> 				Jirka
> 
> -- 
> ------------------------------------------------------------------
>  Jirka Kosek      e-mail: jirka@kosek.cz      http://xmlguru.cz
> ------------------------------------------------------------------
>     Professional XML and Web consulting and training services
> DocBook/DITA customization, custom XSLT/XSL-FO document processing
> ------------------------------------------------------------------
>    Bringing you XML Prague conference    http://xmlprague.cz
> ------------------------------------------------------------------

Received on Tuesday, 27 December 2022 23:10:31 UTC