- From: Reece Dunn <msclrhd@googlemail.com>
- Date: Wed, 28 Dec 2022 09:49:47 +0000
- To: Michael Kay <mike@saxonica.com>
- Cc: Jirka Kosek <jirka@kosek.cz>, "public-xslt-40@w3.org" <public-xslt-40@w3.org>
- Message-ID: <CAGdtn25uEkwJk0x5YTqbT=vqgZ82jxx9f8B_BC-Kd0aaLkyuQQ@mail.gmail.com>
Hi, I'm wondering if it would make sense to construct a set of tests that: a) exercise the different parts of the HTML parser algorithm; b) exercise the different parts of the HTML tree construction algorithm (e.g. the addition of a missing html element); c) exercise the various HTML entities; d) exercise the various void elements; e) cover the various html elements. Note: For HTML5 support, I've used JSoup (https://github.com/jhy/jsoup) in various projects. - Reece On Tue, 27 Dec 2022 at 23:10, Michael Kay <mike@saxonica.com> wrote: > To start with, I've cloned the HTML5 test repository and found it has > 47272 files with suffix HTML. > > Giving each of these a signature formed by taking the first three > characters of each start tag, there are 13787 distinct signatures. > > I've divided the files into groups by signature, I'm taking the first file > in every tenth group, giving a set of 1378 test files. > > I've then tried parsing these files and converting to an XDM using > saxon:parse-html() on both SaxonJ (using TagSoup) and SaxonCS (using > HtmlAgilityPack), to establish reference results. However, the results are > inadequate. Many of the tests take advantage of HTML5 tag omission (e.g. > omitting the outer `html` tag) and neither of the existing implementations > can cope with this. So I'm going to have to write a better HTML5->XML > converter in order to generate the reference results -- and unfortunately, > at that point, the test will likely become self-fulfilling. But a second > implementation running the tests should confirm that they're OK. > > Michael Kay > Saxonica > > > On 23 Dec 2022, at 22:31, Jirka Kosek <jirka@kosek.cz> wrote: > > > > On 22.12.2022 1:06, Michael Kay wrote: > >> I've just been running a few new tests on our existing parse-html() > function on SaxonJ (built on TagSoup) and SaxonCS (built on > HtmlAgilityPack) and reallising how different they are. I suspect that > getting a good level of interoperability (and tests to prove it) for > fn:parse-html is going to be challenging! > > Hi, > > > > I think it would be good to have parsing consistent with web browsers > which means implementing HTML5 parsing algorithm. I have been using the > following parser when I needed to process HTML5 input by XSLT: > > > > https://about.validator.nu/htmlparser/ > > > > Perhaps switching to this parser from TagSoup would give better results > if some other HTML5 compliant parser would be used in .NET product as well. > > > > Jirka > > > > -- > > ------------------------------------------------------------------ > > Jirka Kosek e-mail: jirka@kosek.cz http://xmlguru.cz > > ------------------------------------------------------------------ > > Professional XML and Web consulting and training services > > DocBook/DITA customization, custom XSLT/XSL-FO document processing > > ------------------------------------------------------------------ > > Bringing you XML Prague conference http://xmlprague.cz > > ------------------------------------------------------------------ > > >
Received on Wednesday, 28 December 2022 09:50:10 UTC