- From: Norm Tovey-Walsh <norm@saxonica.com>
- Date: Mon, 05 Sep 2022 08:17:30 +0100
- To: ixml <public-ixml@w3.org>
- Message-ID: <m21qsqi5j8.fsf@saxonica.com>
Hello world, Yesterday, I published a new experimental CoffeePot release with support for a feature to parse an input document as a set of records. It’s described in the documentation[1] and I wrote a post about it[2]. (Executive summary: you provide a regex for the start- or end-of-record boundary.) The idea is that a file, like UnicodeData.txt (34,626 lines, ~1.9M), is logically a set of records. Rather than writing a grammar to parse the whole file, I can write a grammar to parse each record and call the parser once for each record. Given an iXML grammar for all of the file, if I let my Earley parser run on UnicodeData.txt for about five minutes, it says that it’s parsed 6.1% of the file and will take another 1h14m to finish. The GLL parser says it’s parsed 7.5% and will take another 1h2m to finish. They’re both lying about how much longer the parse will take. That figure is still getting bigger in both cases. Using the record-oriented approach, the Earley parser parses the whole file in 28s, the GLL parser in 45s. The result is ~9.8M and two lines longer because of the outermost “records” element that wraps all the records. It surprises me slightly that the GLL parser is slower, I suspect I’m rebuilding the executable “bytecode” for each GLL parse or something. Given a subset of the file about 1,500 lines long, the GLL parser does the whole thing in 158s where the Earley parser takes 258s. A mystery for another day. Be seeing you, norm [1] https://coffeepot.nineml.org/ [2] https://so.nwalsh.com/2022/09/04/records -- Norm Tovey-Walsh Saxonica
Received on Monday, 5 September 2022 08:34:36 UTC