CSS, level 2 - query/comment - lex notation from Kent M Pitman on 1998-06-23 (www-style@w3.org from June 1998)

From: Kent M Pitman <kmp@harlequin.com>
Date: Mon, 22 Jun 1998 21:28:13 -0400 (EDT)
To: www-style@w3.org
Cc: kmp@harlequin.com
Message-Id: <9806230132.AA00824@excel.harlequin.com>
In PR-CSS2-19980324, in 4.1.1 Tokenization, I complained the other
evening about a few things I noticed.  Here are some more.  I don't
know if this is a standard "lex" thing, but the following certainly
annoys me both as a choice of presentational style and because of its
implications:

In the definition of the 'escape' macro sequence you have

 unicode ::= \\[0-9aa-f]{1-6}[ \n\r\t\f]?

 escape ::= {unicode}|\\[ ~-\200-\4177777]

Some implications of this are:

 * \a matches BOTH the unicode syntax and the escape syntax.

   I would feel more comfortable if the range of 0-9 and a-f
   were exempted from the definition of escape, so that I'd know
   for sure what was supposed to match.  Maybe lex just finds the
   first of the matches and doesn't tell you that there were ambiguities,
   but since I don't use lex, I find the ambiguity annoying because there
   is no discussion anywhere about how the 'first match' has preference.

 * \A ONLY matches escape, so

      \A = the letter A

   but

      \a = the letter whose hex code is 10, which is Newline

 * \f is the high end of codes that are used up by hex, so

      \a = Newline  (code 10)
      \b = Ascii VT (code 11)
      \c = Form     (code 12)
      \d = Return   (code 13)
      \e = Ascii SO (code 14)
      \f = Ascii SI (code 15)
      \g = Lowercase g
      \h = Lowercase h

   Personally, I think this is extraordinarily weird and ugly.
   If you're going to not just have \ have the simple meaning of
   "quote the next char" and instead are going to punch a hole
   in the space, then do yourself a favor and make the space "sparse"
   to catch syntax errors.  Don't define a convoluted space that has
   weird non-linear transitions between unrelated theories about how
   to do things.

 * \n, \t, and \r don't have their usual control-char meanings.
   Instead they mean n, t, and r.  I can't believe anyone who writes
   \n, \t, and \r will EVER mean this since those people will be writing
   n, t, or r.  These will almost surely be bugs but you'll never be able
   to catch them.

 * In the descriptions of string1 and string2, there's an ambiguity again
   but in this case the FIRST set matches the wrong thing.  This looks more
   severe.  Does your lex gizmo really have the smarts to get all of these
   things right without special ESP hardware?  In both string1 and string2,
   you show the string contents as: ([\t !#$%&(-~]|...|{escape})*
   but the odd thing is that backslash is in the range [(-~] so I would
   think you'd never get to it.  If you implement (as I did) exactly what's
   there, \ gets sucked up as an ordinary character and is never around in
   order to match escape.  Or is it that macro productions have some priority
   over non-macro productions (which would seem to violate all decent sense
   of referential transparency).  The same problem happens for the \\\n
   thing, which can't be reached either because backslash has already been
   matched.

 * In the description of \n, you further complicate the above-described
   problem with what \<char> means because already \a means "the char
   whose code is 10" so that already means what \n is presumably also going
   to mean here.  So you can either say "xxx\axxx" or "xxx\nxxx" and both
   mean the same thing.  But you don't want to say "xxx\fxxx" if you want
   formfeed because "xxx\cxxx" is how you say that.  And you definitely 
   don't want to say "xxx\txxx" to get tab because you have to use a
   literal tab.  (Incidentally, you nowhere explain what \\\n MEANS--you
   only say that \n is allowed syntactically.  You don't say if it maps to
   10 or 13 or both or a system-specific set.  I suspect this to be a bit of
   C-chauvinism creeping in--just assuming that it's well-defined, or that it's
   ok that it's not.)  What a hodgepodge.
   PLEASE PLEASE PLEASE if you're going to freeze the syntax for all time,
   please make it a good and consistent and predictable one.

   My recommendation is that you not allow backslash
   before single chars other than things from \177 upward, and that you 
   reserve the entire ASCII space as unusable after backslash except
   0-9 a-f A-F and the special characters t and n.  And that you do it
   CONSISTENTLY.  The idea that \n is allowed in strings but NOT in idents
   seems silly when the return character is permitted in idents if you use
   a different syntax (\a), for example.  So one must say xxx\axxx
   to get newline into an ident, but can say "xxx\nxxx" to get it into a 
   string.  Of course, \n and \a aren't really synonyms since
   "foo\nbar" and "foo\abar" mean radically different things. (sigh)
   And "foo\n bar" and "foo\a bar" also mean different things. (sigh again)

 * I don't see any size limitations on the tokens that are identifiers or
   strings, is that right?  I can live with this, but it's messy.  If I find
   out that other people are secretly getting away with bounds and that they're
   not telling me I could be doing likewise, I'll be a bit bummed out, though.
   For example, is there a deep dark secret truth about Lex's max buffer size
   that I'm not being told about?

Also here are a few comments about the parsing of numbers:

 * Is there REALLY no limit on the number of digits before and after the
   decimal point in a num?  I don't really mind there not being.  I'm working
   in Lisp and we have arbitrary-precision arithmetic, but geez--I thought
   this was a major headache for those guys with languages like C and Java
   that don't boast arbitrary-precision arithmetic.

 * The production for num does not generate negative numbers.  There's no
   provision for a leading hyphen even though it would not lead to any 
   ambiguity for there to be a leading hyphen.  There are some examples of
   stuff (e.g., in 9.8.4 Absolute Positioning) where negative numbers appear
   to be used such as -100px which looks to me like a DIMENSION which is 
   defined as {num}{ident} in 4.1.1, and which therefore doesn't look valid.
   IF I AM CORRECTLY UNDERSTANDING THE SITUATION, THIS ERROR LOOKS SEVERE.
Received on Tuesday, 23 June 1998 04:45:52 UTC