Re: *which* alternative that matches nothing? (was Re: repetition)

Dave Pawson <dave.pawson@gmail.com> writes:
> I’m suggesting my preference for a single representation of an empty
> set, from the selection that Steven offered- on grounds of simplicity.
> However you name it, null, empty set or empty sequence ( end user view
> cf implementers view)

I understand the motivation to have a single representation with a
single name, but unfortunately, there are two different things here and
they aren’t the same, so they can’t have the same name.

Let’s look at some examples. I’m going to leave out the quotation marks
because I don’t think that level of detail is necessary.

first: a, b, c .

That matches an “a” followed by a “b”, followed by a “c”. Let’s think
about how a parser might match this rule. (This is a sloppy, conceptual
description of a parser, not a technical one.)

The parser wonders, am I looking at a “first”? It hasn’t started
matching anything yet, so we can mark its position in the rule with a
“^”:

  first: ^a, b, c .

If the current input token is an “a”, the parser says “yup, that
matches”, and advances its position:

  first: a, ^b, c .

Now it gets the next token. If it gets a “b”, that’ll match and it can
continue. If the next token isn’t a “b”, then it can’t be looking at a
“first”. If it gets a “c” after the “b”, then it can say it was looking
at a “first”.

Now what about this rule?

  second: ^a, (), b, c .

If the current input token is an “a”, the parser says “yup, that
matches”, and advances its position:

  second: a, ^(), b, c .

Now it’s looking at a form that matches an empty string. It can say,
“yup, that matches,” and advance its position *without* getting the next
token.

  second: a, (), ^b, c .

Now it gets the next token. If it gets a “b”, that’ll match and it can
continue. If the next token isn’t a “b”, then it can’t be looking at a
“second”.

What about this rule:

  third: ^a, [aeiou], b, c .

If the current input token is an “a”, the parser says “yup, that
matches”, and advances its position:

  third: a, ^[aeiou], b, c .

Now it gets the next token. If it gets one of the common English
language vowels, that’ll match and it can continue. If the next token
is an “x”, then it can’t be looking at a “third”.

What about this rule:

  fourth: ^a, [], b, c .

If the current input token is an “a”, the parser says “yup, that
matches”, and advances its position:

  fourth: a, ^[], b, c .

Now it gets the next token. If the next token is a character that is a
member of the set that contains no characters, that’ll match. That’s
also a logical impossibility. That will *never* match.

For completeness:

  fifth: ^a, []?, b, c .

One way to look at that is to say that it encodes two different, equally
acceptable rules for “fifth”:

  fifth: ^a, [], b, c .
  fifth: ^a, (), b, c .

(If something is optional then logically it can be replaced by a thing
that matches nothing.)

The parser can match “fifth” if *either* rule succeeds. (We don’t allow
users to have multiple rules for the same non terminal, but we can
imagine what it might mean if we did.)

The parser advances past “a”, advances past () as before, and goes
merrily along matching “fifth” if it finds a “b” and a “c” next.

Those two different things can’t have the same name.

You might say that it’s stupid to allow [] since it can’t match anything
and []? because it can only match (), to say nothing of ()? or []* or
()+. But they’re a natual consequence of generality in the language and
they are genuinely useful sometimes, even if those times are somewhat
uncommon.

Hoping that’s helpful.

                                        Be seeing you,
                                          norm

--
Norm Tovey-Walsh
Saxonica

Received on Saturday, 18 December 2021 12:30:13 UTC