Re: Parsing and lookahead

John Dziurlaj <john@turnout.rocks> writes:
> I am trying to parse the definitions section of the SGML specification. Each definition starts with a clause number (e.g. 4.2), and can run across multiple lines. I can handle cases where a given definition is contained on a single line. However, when the number of lines varies, I am lost as to what to do.

Is it the case that a clause begins with a number:

  3.14 This is the start of a clause

  It can have lots of stuff in it

  NOTE maybe a note

  3.15 This is the next clause…

And you want to capture everything in each clause? Or is there more variation in the data?

>  description: ~[#a;#d]+, ~["0"-"9"]. 

This says a description is “an arbitrary number of characters that aren’t #a or #d followed by a character that isn’t 0-9”. Is that an attempt to exclude the next clause number? Given the presence of NOTEs, I’m not sure that’s going to be sufficient.

> A line feed cannot be used to determine when a new definition begins; however, AFAIK there is no lookahead ability to check for the existence of a new clause (which always indicates a new definition).

Indeed, there’s no lookahead. You can’t peek forward without consuming.

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Received on Sunday, 4 May 2025 14:50:28 UTC