- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Wed, 03 Aug 2022 16:54:31 -0600
- To: Norm Tovey-Walsh <norm@saxonica.com>
- Cc: Steven Pemberton <steven.pemberton@cwi.nl>, public-ixml@w3.org
Norm Tovey-Walsh <norm@saxonica.com> writes: > [[PGP Signed Part:Undecided]] > Steven Pemberton <steven.pemberton@cwi.nl> writes: >> Hopefully I'm more awake now, and have got it right. The Gedcom >> version, with links, with nesting. > > I think it’s important to observe that this doesn’t solve the problem as > presented, it solves a subset of the problem. It can’t handle nesting > more than 9 nevels and I don’t think ixml can handle arbitrarily deep > nesting. (Though as Steve pointed out, you could scan for the level and > then generate a grammar that had that many levels.) I may have understood the presentation of the problem differently; I thought the question was along the lines of "can you do anything useful with this?" Clearly writing an ixml grammar that accepts all and only valid Gedcom data counts as doing something useful. But I think other things also count as useful, and I think Steven's grammar shows one helpful approach. It accepts all and only files where the level numbers follow the rules (a datum with the level number L+1 is subordinate to and logically enclosed by the most recent datum with level number L, and the level number must increase by at most 1) and the level numbers are all between 0 and 9 inclusive. (Like Steven's sample solution, I will ignore here the Gedcom rules for continuation lines, escapes, character-set shifts and other complications in order to focus on the questions raised by the level numbers.) So I think Steven's solution counts as a successful response to Mike's challenge. It does, after all, support nesting more than twice as deep as is found in the sample input. It seems likely to handle a useful range of Gedcom input. A second useful grammar one could write would accept all Gedcom data with correct level numbers, and reject some but not all Gedcom data with ungrammatical level numbers. gedcom: record*. record: -"0 ", -field, f1*. f1: -"1 ", field, f2*. f2: -"2 ", field, f3*. f3: -"3 ", field, f4*. f4: -"4 ", field, f5*. f5: -"5 ", field, f6*. f6: -"6 ", field, f7*. f7: -"7 ", field, f8*. f8: -"8 ", field, f9*. f9: -"9 ", field, fn0*. fn0: tens, -"0 ", field, fn1*. fn1: tens, -"1 ", field, fn2*. fn2: tens, -"2 ", field, fn3*. fn3: tens, -"3 ", field, fn4*. fn4: tens, -"4 ", field, fn5*. fn5: tens, -"5 ", field, fn6*. fn6: tens, -"6 ", field, fn7*. fn7: tens, -"7 ", field, fn8*. fn8: tens, -"8 ", field, fn9*. fn9: tens, -"9 ", field, fn0*. -field: name, (text; link)?, -#a; id, -" ", name, -#a. @id: -"@", name, -"@". @text: -" ", (~["@"; #a], ~[#a]*)?. @link: -" ", -"@", name, -"@". @name: ["A"-"Z"; "0"-"9"]+. -tens: -['1' - '9'], -['0' - '9']*. This is the same as Steven's grammar except for the change to the definition of f9 and the addition of the rules for fn0 through fn9 and the rule for tens. This will accept any legitimate sequence of level numbers, and will also accept a sequence of level numbers in which some digit other than the last digit is wrong. So it will allow a line at level 23 to be followed by a line at level 94. If the goal is to validate an input stream and determine whether it conforms to the Gedcom grammar or not, then failing to detect such errors is a flaw; if the goal is to parse Gedcom data and provide an accurate XML representation of it, then it may be a matter of indifference what the parser does with other input. If there is no requirement to raise an error on input that is not syntactically correct Gedcom data, then the failure to detect the 23/94 error is not a flaw. On the other hand, this second grammar also fails to provide the correct nesting if a dataum at level 23 is immediately followed by a datum at level 13. That's probably a more serious flaw. If I deployed this grammar in a real project I would retain the level numbers so I could check them downstream and correct the structure. A third grammar would move the entire problem of structure downstream and say something like record: line+. line: @level-number, -" ", -field. level-number: [Nd]. ... But this grammar just gives up on the level-number challenge, so I don't like it very much. I'd rather have a grammar that correctly handled all the Gedcom data I was likely to see, even if it failed to flag ungrammatical input if fed some forms of non-Gedcom data. And in fact I think that is likely to be possible (at least, as far as level numbers are concerned). As Steven said, the Gedcom grammar is not a context-free language. He meant, I think, that it requires more than context-free power, but as far as level numbers are concerned the reason it's not context-free in the usual sense is that it's a regular language. The spec says Level numbers must be between 0 and 99 and must not contain leading zeroes, for example, level one must be 1, not 01. So a variant of Steven's grammar with rules for f10 through f99 will correctly parse all legal sequences of Gedcom level numbers. Norm is, as far as I can tell, right that this approach cannot handle arbitrarily deep nesting. But when a data format carefully written in such a way that from the comments in grammar it is possible to learn how big each buffer needs to be in a program to process data in the format, then it seems likely that there will also be a fixed maximum size to the buffer for level numbers. Michael -- C. M. Sperberg-McQueen Black Mesa Technologies LLC http://blackmesatech.com
Received on Wednesday, 3 August 2022 23:50:11 UTC