Parsing “records”

Hello world,

Yesterday, I published a new experimental CoffeePot release with support
for a feature to parse an input document as a set of records. It’s
described in the documentation[1] and I wrote a post about it[2].
(Executive summary: you provide a regex for the start- or end-of-record
boundary.)

The idea is that a file, like UnicodeData.txt (34,626 lines, ~1.9M), is
logically a set of records. Rather than writing a grammar to parse the
whole file, I can write a grammar to parse each record and call the
parser once for each record.

Given an iXML grammar for all of the file, if I let my Earley parser run
on UnicodeData.txt for about five minutes, it says that it’s parsed 6.1%
of the file and will take another 1h14m to finish. The GLL parser says
it’s parsed 7.5% and will take another 1h2m to finish.

They’re both lying about how much longer the parse will take. That
figure is still getting bigger in both cases.

Using the record-oriented approach, the Earley parser parses the whole
file in 28s, the GLL parser in 45s. The result is ~9.8M and two lines
longer because of the outermost “records” element that wraps all the
records.

It surprises me slightly that the GLL parser is slower, I suspect I’m
rebuilding the executable “bytecode” for each GLL parse or something.
Given a subset of the file about 1,500 lines long, the GLL parser does
the whole thing in 158s where the Earley parser takes 258s. A mystery
for another day.

                                        Be seeing you,
                                          norm

[1] https://coffeepot.nineml.org/
[2] https://so.nwalsh.com/2022/09/04/records


--
Norm Tovey-Walsh
Saxonica

Received on Monday, 5 September 2022 08:34:36 UTC