More thoughts about grammar combinations

Hello,

I’m trying to synthesize the ideas in John’s draft (thank you John!)
nwith some of the discussion we had yesterday into a (sketch of a)
concrete proposal.

My proposed design constraints:

1. The description of grammar combination has to be at a level of
   abstraction that permits different implementation strategies. (We
   have to say what needs to be done, not how it’s done.)

2. Grammar authors need to be able to say what is “exposed”. (This isn’t
   strictly true, we could allow anything to be imported and say that if
   the author of the imported grammar changes it and your importing
   grammar breaks, that’s your lookout. But we can do better than that.)

3. Importing a symbol X must import all of the nonterminals that X
   depends on, but must not expose them.

   In other words, given the grammar:
 
   X = A, B. A = "a". B = "b".

   If I import (only) X, I should be able to use X in my grammar, it
   should match “ab”, it should produce the result
   <X><A>a</A><B>b</B></X>, but it must be an error for me to attempt
   to use A or B in my grammar.

4. A corollary of point 3 is that my grammar should be able to define
   a nonterminal A and use it without knowing that importing X has
   had the effect of importing some other nonterminal named A in the
   imported grammar.

5. When I import a nonterminal, I must be able to change its name. Not
   only does this allow me to avoid conflicts in my grammar, it is
   necessary to avoid conflicts if importing multiple grammars.

I think those constraints can be satisified and produce a coherent model
for grammar combination. Let’s imagine some syntax!

In my grammar-to-be-imported, I add an “export” declaration. There are a
bunch of ways we could do this, some more familiar than others depending
on your experience with various programming languages. Java and its
descendants use public/protected/private keywords, XSLT has a
“visibility” property that takes similar values. We could make either of
those work.

My intuition is that putting the declaration inline with the rules is
going to make the grammar harder to read. Let’s initially imagine
declaring this in the prolog.

John introduced the notion of wildcards (because if I have a couple of
dozen nonterminals and I only want to conceal a handful of them as
“implementation details” then listing all of the exported ones
explicitly could be tedious and error prone) but that seems quite
complicated and I don’t see how to make it play nicely with renaming, so
I’m imagining something a little different. How about:

  public X, B.

or

  private Y, Z.

The former says that if this grammar is imported, the nonterminals X and
B may be imported. The latter says any nonterminal except Y or Z can be
imported. It’s an error to have both declarations. If neither is
present, any nonterminal can be imported. Optional: the declaration
“public .” prohibits any nonterminals from being imported. (And I guess
for symmetry “private .” allows them all, explicitly.)

In the importing grammar, we need a way to identify the grammar to be
imported, identify the symbols to be imported, and rename the symbols.
How about:

  import "uri-of-imported-grammar.ixml" for X, B as C.

You can have multiple import declarations. The order doesn’t matter, and
it doesn’t matter if you prefer to write:

  import "uri-of-imported-grammar.ixml" for X.
  import "uri-of-imported-grammar.ixml" for B as C.
  import "uri-of-other-grammar.ixml" for B as Q.

But all of the imported names (X, C, and Q above) must be unique.

We have to decide if this is allowed, or is an error:

  import "uri-of-imported-grammar.ixml" for B as C.
  import "uri-of-imported-grammar.ixml" for B as D.

I don’t think it adds any functionality. If you wanted B as both C and
D, you could equally use:

  import "uri-of-imported-grammar.ixml" for B as C.
  D = -C.

So it’s a question of which will be least surprising to users, I think.

The import statement:

  import "uri-of-a-third-grammar.ixml" .

imports all of the public symbols from the grammar without renaming any
of them.

I’m assuming we’ll adopt Steven’s renaming proposal and the names in the
input declaration are “naming”s. That means you can say:

  import "uri-of-imported-grammar.ixml" for X, B as C>B.

which means that you’re importing B, calling it C in this grammar, but
saying it serializes as B. I think that’s neat and tidy.

I have one optional feature to propose: for an imported grammar, we
relax the rule that says undefined nonterminals are an error. (If you
attempt to use the imported grammar directly, the error still applies so
I’m not proposing to remove the rule altogether.)

Given “x.ixml”:

   X = A, B.
   A = "a".

(which is incomplete because B is undefined), I can then do this:

  import "x.ixml" for X.

  S = X.
  B = "B"

This allows an author to write a grammar with “slots” where user-defined
behavior can be inserted.  

Looking at the use cases document, I believe this proposal satisfies
INC-BY-REF, TRANSITIVE, AVOID-COLLISION, RENAME-NT,
NT-VISIBILITY-EXPORT, NT-VISIBILITY-ACCEPT.

It doesn’t satisfy OVERRIDE-NT or NT-COMBINE.

I don’t think it would be too hard to add support for these to the
syntax, but I fear it makes the conceptual model that much larger and
more complex. I think you can accomplish NT-COMBINE with existing
features:

Given “y.ixml”:

   Y = A, B. A = "a". B = "b".

I can then do this:

  import "y.ixml" for Y.

  S = Z.
  Z>Y = -Y, "c".

And S matches “a b c” but outputs Y.

The OVERRIDE-NT use case seems to stand in conflict with the idea that
the author of the imported grammar has control over what symbols are
exposed. If we allow undefined symbols to be used as slots, we get most
of the same functionality but with the imported grammar author’s
blessing, as it were.

I propose that the smallest useful grammar combination feature doesn’t
need to support OVERRIDE-NT.

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Received on Wednesday, 6 September 2023 08:08:28 UTC