Re: Bug in grammar for paths from Boris Dalstein on 2019-11-17 (www-svg@w3.org from November 2019)

From: Boris Dalstein <dalboris@gmail.com>
Date: Sun, 17 Nov 2019 18:16:05 +0100
To: www-svg@w3.org
Message-ID: <39580370-b8ec-fbce-c3b8-6dd15122d4e2@gmail.com>
And by the way, I'd propose to simplify the readability of the syntax.

Just renaming the grammar identifiers, from the SVG 1.1 spec, we have:

     unsigned: int | float
     number:   (sign? int) | (sign? float)
     int:      digits
     float:    (frac exp?) | (digits exp)
     frac:     (digits? "." digits) | (digits ".")
     exp:      ("e" | "E") sign? digits
     sign:     "+" | "-"
     digits:   digit | digit digits
     digit:    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Which we could simplify to the much more concise and readable, while
equivalent, following syntax:

     number:   sign? unsigned
     unsigned: ((digits "."?) | (digits? "." digits)) exp?
     exp:      ("e" | "E") sign? digits
     sign:     "+" | "-"
     digits:   digit+
     digit:    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

 From which we can more easily derive the following regex, which
we could also add to the spec:

[+\-]?(([0-9]+\.?)|([0-9]*\.[0-9]+))([eE][+\-][0-9]+)?

Any thoughts?

Best regards,
Boris

PS: below is the proof of the equivalence of the two grammars.

First, since it is indicated that in the grammar, the symbol `+` means 
"one or more",
instead of:

     digit | digit digits

we can simply write:

     digit+

Also, the following:

     number: (sign? int) | (sign? float)

can be factorized in:

     number: sign? (int | float)

So the rules now look like this:

     unsigned: int | float
     number:   sign? (int | float)
     int:      digits
     float:    (frac exp?) | (digits exp)
     frac:     (digits? "." digits) | (digits ".")
     exp:      ("e" | "E") sign? digits
     sign:     "+" | "-"
     digits:   digit+
     digit:    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Then, there is the (int | float) repetition, which can be avoided by simply
defining signed numbers in terms of unsigned numbers:

     number:   sign? unsigned
     unsigned: int | float
     int:      digits
     float:    (frac exp?) | (digits exp)
     frac:     (digits? "." digits) | (digits ".")
     exp:      ("e" | "E") sign? digits
     sign:     "+" | "-"
     digits:   digit+
     digit:    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Then, since "int" is just an alias for "digits", we can just remove it:

     number:   sign? unsigned
     unsigned: digits | float
     float:    (frac exp?) | (digits exp)
     frac:     (digits? "." digits) | (digits ".")
     exp:      ("e" | "E") sign? digits
     sign:     "+" | "-"
     digits:   digit+
     digit:    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Then, there is the "exp" appearing twice. We can be smarter and make it 
appear
only once, so it's easier to build a regexp. For example, the following two
rules:

     unsigned: digits | float
     float:    (frac exp?) | (digits exp)

Can be more simply rewritten as one rule:

     unsigned: digits | (frac exp?) | (digits exp)

For which it becomes clear that is is in fact:

     unsigned: (digits exp?) | (frac exp?)

And even more simply:

     unsigned: (digits | frac) exp?

So all the rules now become:

     number:   sign? unsigned
     unsigned: (digits | frac) exp?
     frac:     (digits? "." digits) | (digits ".")
     exp:      ("e" | "E") sign? digits
     sign:     "+" | "-"
     digits:   digit+
     digit:    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

But since frac only appears in one rule, let's just substitute it:

     number:   sign? unsigned
     unsigned: (digits | (digits? "." digits) | (digits ".")) exp?
     exp:      ("e" | "E") sign? digits
     sign:     "+" | "-"
     digits:   digit+
     digit:    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"

Oh, but now there's more simplifications we can do! If you look at
the following:

     digits | (digits? "." digits) | (digits ".")

You can see that the third OR clause can be integrated into the first,
by simply making the "." optional:

     (digits "."?) | (digits? "." digits)

So here we are, the quite unreadable rules we started with are in fact
equivalent to these ones, much more readable:

     number:   sign? unsigned
     unsigned: ((digits "."?) | (digits? "." digits)) exp?
     exp:      ("e" | "E") sign? digits
     sign:     "+" | "-"
     digits:   digit+
     digit:    "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"


On 17/11/2019 17:14, Boris Dalstein wrote:
> Bouncing back on this as I'm currently writing a parser.
>
> The grammar on the SVG 1.1 spec includes fractional numbers and 
> numbers with an exponent part:
>
> https://www.w3.org/TR/SVG11/paths.html#PathDataBNF
>
> number:
>      sign? integer-constant
>      | sign? floating-point-constant
> integer-constant:
>      digit-sequence
> floating-point-constant:
>      fractional-constant exponent?
>      | digit-sequence exponent
> fractional-constant:
>      digit-sequence? "." digit-sequence
>      | digit-sequence "."
> exponent:
>      ( "e" | "E" ) sign? digit-sequence
> sign:
>      "+" | "-"
> digit-sequence:
>      digit
>      | digit digit-sequence
> digit:
>      "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
>
> While the grammar on the latest SVG 2 CR only contains integers:
>
> https://www.w3.org/TR/2018/CR-SVG2-20181004/paths.html#PathDataBNF
>
> number ::= ([0-9])+
>
> This sounds like an important omission.
>
> Best regards,
> Boris
>
> On 29/04/2017 19:26, Jirka Kosek wrote:
>> On 29.4.2017 18:57, Paul LeBeau wrote:
>>> Paths of the form that I presented do exist and are actually common.  I
>>> wasn't around when the grammar was originally written, so I don't know the
>>> reason why it was written the way it was.
>> Seems that grammar is only illustrational because there are other issues
>> with it -- for example grammar accepts only integers not decimal numbers.
>>
>
Received on Sunday, 17 November 2019 17:16:12 UTC