- From: Norm Tovey-Walsh <norm@saxonica.com>
- Date: Sat, 18 Dec 2021 11:57:49 +0000
- To: Dave Pawson <dave.pawson@gmail.com>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, Steven Pemberton <steven.pemberton@cwi.nl>, public-ixml@w3.org
- Message-ID: <m2a6gy5c5a.fsf@saxonica.com>
Dave Pawson <dave.pawson@gmail.com> writes: > I’m suggesting my preference for a single representation of an empty > set, from the selection that Steven offered- on grounds of simplicity. > However you name it, null, empty set or empty sequence ( end user view > cf implementers view) I understand the motivation to have a single representation with a single name, but unfortunately, there are two different things here and they aren’t the same, so they can’t have the same name. Let’s look at some examples. I’m going to leave out the quotation marks because I don’t think that level of detail is necessary. first: a, b, c . That matches an “a” followed by a “b”, followed by a “c”. Let’s think about how a parser might match this rule. (This is a sloppy, conceptual description of a parser, not a technical one.) The parser wonders, am I looking at a “first”? It hasn’t started matching anything yet, so we can mark its position in the rule with a “^”: first: ^a, b, c . If the current input token is an “a”, the parser says “yup, that matches”, and advances its position: first: a, ^b, c . Now it gets the next token. If it gets a “b”, that’ll match and it can continue. If the next token isn’t a “b”, then it can’t be looking at a “first”. If it gets a “c” after the “b”, then it can say it was looking at a “first”. Now what about this rule? second: ^a, (), b, c . If the current input token is an “a”, the parser says “yup, that matches”, and advances its position: second: a, ^(), b, c . Now it’s looking at a form that matches an empty string. It can say, “yup, that matches,” and advance its position *without* getting the next token. second: a, (), ^b, c . Now it gets the next token. If it gets a “b”, that’ll match and it can continue. If the next token isn’t a “b”, then it can’t be looking at a “second”. What about this rule: third: ^a, [aeiou], b, c . If the current input token is an “a”, the parser says “yup, that matches”, and advances its position: third: a, ^[aeiou], b, c . Now it gets the next token. If it gets one of the common English language vowels, that’ll match and it can continue. If the next token is an “x”, then it can’t be looking at a “third”. What about this rule: fourth: ^a, [], b, c . If the current input token is an “a”, the parser says “yup, that matches”, and advances its position: fourth: a, ^[], b, c . Now it gets the next token. If the next token is a character that is a member of the set that contains no characters, that’ll match. That’s also a logical impossibility. That will *never* match. For completeness: fifth: ^a, []?, b, c . One way to look at that is to say that it encodes two different, equally acceptable rules for “fifth”: fifth: ^a, [], b, c . fifth: ^a, (), b, c . (If something is optional then logically it can be replaced by a thing that matches nothing.) The parser can match “fifth” if *either* rule succeeds. (We don’t allow users to have multiple rules for the same non terminal, but we can imagine what it might mean if we did.) The parser advances past “a”, advances past () as before, and goes merrily along matching “fifth” if it finds a “b” and a “c” next. Those two different things can’t have the same name. You might say that it’s stupid to allow [] since it can’t match anything and []? because it can only match (), to say nothing of ()? or []* or ()+. But they’re a natual consequence of generality in the language and they are genuinely useful sometimes, even if those times are somewhat uncommon. Hoping that’s helpful. Be seeing you, norm -- Norm Tovey-Walsh Saxonica
Received on Saturday, 18 December 2021 12:30:13 UTC