Re: Gedcom example from C. M. Sperberg-McQueen on 2022-08-03 (public-ixml@w3.org from August 2022)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Wed, 03 Aug 2022 16:54:31 -0600
To: Norm Tovey-Walsh <norm@saxonica.com>
Cc: Steven Pemberton <steven.pemberton@cwi.nl>, public-ixml@w3.org
Message-ID: <87les4c2fz.fsf@blackmesatech.com>
Norm Tovey-Walsh <norm@saxonica.com> writes:

> [[PGP Signed Part:Undecided]]
> Steven Pemberton <steven.pemberton@cwi.nl> writes:
>> Hopefully I'm more awake now, and have got it right. The Gedcom
>> version, with links, with nesting.
>
> I think it’s important to observe that this doesn’t solve the problem as
> presented, it solves a subset of the problem. It can’t handle nesting
> more than 9 nevels and I don’t think ixml can handle arbitrarily deep
> nesting. (Though as Steve pointed out, you could scan for the level and
> then generate a grammar that had that many levels.)

I may have understood the presentation of the problem differently; I
thought the question was along the lines of "can you do anything useful
with this?"

Clearly writing an ixml grammar that accepts all and only valid Gedcom
data counts as doing something useful.  But I think other things also
count as useful, and I think Steven's grammar shows one helpful
approach.  It accepts all and only files where the level numbers follow
the rules (a datum with the level number L+1 is subordinate to and
logically enclosed by the most recent datum with level number L, and the
level number must increase by at most 1) and the level numbers are all
between 0 and 9 inclusive.  (Like Steven's sample solution, I will
ignore here the Gedcom rules for continuation lines, escapes,
character-set shifts and other complications in order to focus on the
questions raised by the level numbers.)

So I think Steven's solution counts as a successful response to Mike's
challenge.

It does, after all, support nesting more than twice as deep as is found
in the sample input.  It seems likely to handle a useful range of Gedcom
input.

A second useful grammar one could write would accept all Gedcom data
with correct level numbers, and reject some but not all Gedcom data with
ungrammatical level numbers.

gedcom: record*.
record: -"0 ", -field, f1*.
f1: -"1 ", field, f2*.
f2: -"2 ", field, f3*.
f3: -"3 ", field, f4*.
f4: -"4 ", field, f5*.
f5: -"5 ", field, f6*.
f6: -"6 ", field, f7*.
f7: -"7 ", field, f8*.
f8: -"8 ", field, f9*.
f9: -"9 ", field, fn0*.

fn0: tens, -"0 ", field, fn1*.
fn1: tens, -"1 ", field, fn2*.
fn2: tens, -"2 ", field, fn3*.
fn3: tens, -"3 ", field, fn4*.
fn4: tens, -"4 ", field, fn5*.
fn5: tens, -"5 ", field, fn6*.
fn6: tens, -"6 ", field, fn7*.
fn7: tens, -"7 ", field, fn8*.
fn8: tens, -"8 ", field, fn9*.
fn9: tens, -"9 ", field, fn0*.

-field: name, (text; link)?, -#a;
        id, -" ", name, -#a.
@id: -"@", name, -"@".
@text: -" ", (~["@"; #a], ~[#a]*)?.
@link: -" ", -"@", name, -"@".
@name: ["A"-"Z"; "0"-"9"]+.
-tens: -['1' - '9'], -['0' - '9']*.

This is the same as Steven's grammar except for the change to the
definition of f9 and the addition of the rules for fn0 through fn9 and
the rule for tens.  This will accept any legitimate sequence of level
numbers, and will also accept a sequence of level numbers in which some
digit other than the last digit is wrong.  So it will allow a line at
level 23 to be followed by a line at level 94.

If the goal is to validate an input stream and determine whether it
conforms to the Gedcom grammar or not, then failing to detect such
errors is a flaw; if the goal is to parse Gedcom data and provide an
accurate XML representation of it, then it may be a matter of
indifference what the parser does with other input.  If there is no
requirement to raise an error on input that is not syntactically correct
Gedcom data, then the failure to detect the 23/94 error is not a flaw.

On the other hand, this second grammar also fails to provide the correct
nesting if a dataum at level 23 is immediately followed by a datum at
level 13.  That's probably a more serious flaw.  If I deployed this
grammar in a real project I would retain the level numbers so I could
check them downstream and correct the structure.  

A third grammar would move the entire problem of structure downstream
and say something like

record: line+.
line: @level-number, -" ", -field.
level-number: [Nd].
...

But this grammar just gives up on the level-number challenge, so I don't
like it very much.

I'd rather have a grammar that correctly handled all the Gedcom data I
was likely to see, even if it failed to flag ungrammatical input if fed
some forms of non-Gedcom data.

And in fact I think that is likely to be possible (at least, as far as
level numbers are concerned).  As Steven said, the Gedcom grammar is not
a context-free language.  He meant, I think, that it requires more than
context-free power, but as far as level numbers are concerned the reason
it's not context-free in the usual sense is that it's a regular
language.  The spec says

    Level numbers must be between 0 and 99 and must not contain leading
    zeroes, for example, level one must be 1, not 01.

So a variant of Steven's grammar with rules for f10 through f99 will
correctly parse all legal sequences of Gedcom level numbers. Norm is, as
far as I can tell, right that this approach cannot handle arbitrarily
deep nesting.  But when a data format carefully written in such a way
that from the comments in grammar it is possible to learn how big each
buffer needs to be in a program to process data in the format, then it
seems likely that there will also be a fixed maximum size to the buffer
for level numbers.

Michael
    

-- 
C. M. Sperberg-McQueen
Black Mesa Technologies LLC
http://blackmesatech.com
Received on Wednesday, 3 August 2022 23:50:11 UTC