- From: Kent M Pitman <kmp@harlequin.com>
- Date: Mon, 22 Jun 1998 21:28:13 -0400 (EDT)
- To: www-style@w3.org
- Cc: kmp@harlequin.com
In PR-CSS2-19980324, in 4.1.1 Tokenization, I complained the other evening about a few things I noticed. Here are some more. I don't know if this is a standard "lex" thing, but the following certainly annoys me both as a choice of presentational style and because of its implications: In the definition of the 'escape' macro sequence you have unicode ::= \\[0-9aa-f]{1-6}[ \n\r\t\f]? escape ::= {unicode}|\\[ ~-\200-\4177777] Some implications of this are: * \a matches BOTH the unicode syntax and the escape syntax. I would feel more comfortable if the range of 0-9 and a-f were exempted from the definition of escape, so that I'd know for sure what was supposed to match. Maybe lex just finds the first of the matches and doesn't tell you that there were ambiguities, but since I don't use lex, I find the ambiguity annoying because there is no discussion anywhere about how the 'first match' has preference. * \A ONLY matches escape, so \A = the letter A but \a = the letter whose hex code is 10, which is Newline * \f is the high end of codes that are used up by hex, so \a = Newline (code 10) \b = Ascii VT (code 11) \c = Form (code 12) \d = Return (code 13) \e = Ascii SO (code 14) \f = Ascii SI (code 15) \g = Lowercase g \h = Lowercase h Personally, I think this is extraordinarily weird and ugly. If you're going to not just have \ have the simple meaning of "quote the next char" and instead are going to punch a hole in the space, then do yourself a favor and make the space "sparse" to catch syntax errors. Don't define a convoluted space that has weird non-linear transitions between unrelated theories about how to do things. * \n, \t, and \r don't have their usual control-char meanings. Instead they mean n, t, and r. I can't believe anyone who writes \n, \t, and \r will EVER mean this since those people will be writing n, t, or r. These will almost surely be bugs but you'll never be able to catch them. * In the descriptions of string1 and string2, there's an ambiguity again but in this case the FIRST set matches the wrong thing. This looks more severe. Does your lex gizmo really have the smarts to get all of these things right without special ESP hardware? In both string1 and string2, you show the string contents as: ([\t !#$%&(-~]|...|{escape})* but the odd thing is that backslash is in the range [(-~] so I would think you'd never get to it. If you implement (as I did) exactly what's there, \ gets sucked up as an ordinary character and is never around in order to match escape. Or is it that macro productions have some priority over non-macro productions (which would seem to violate all decent sense of referential transparency). The same problem happens for the \\\n thing, which can't be reached either because backslash has already been matched. * In the description of \n, you further complicate the above-described problem with what \<char> means because already \a means "the char whose code is 10" so that already means what \n is presumably also going to mean here. So you can either say "xxx\axxx" or "xxx\nxxx" and both mean the same thing. But you don't want to say "xxx\fxxx" if you want formfeed because "xxx\cxxx" is how you say that. And you definitely don't want to say "xxx\txxx" to get tab because you have to use a literal tab. (Incidentally, you nowhere explain what \\\n MEANS--you only say that \n is allowed syntactically. You don't say if it maps to 10 or 13 or both or a system-specific set. I suspect this to be a bit of C-chauvinism creeping in--just assuming that it's well-defined, or that it's ok that it's not.) What a hodgepodge. PLEASE PLEASE PLEASE if you're going to freeze the syntax for all time, please make it a good and consistent and predictable one. My recommendation is that you not allow backslash before single chars other than things from \177 upward, and that you reserve the entire ASCII space as unusable after backslash except 0-9 a-f A-F and the special characters t and n. And that you do it CONSISTENTLY. The idea that \n is allowed in strings but NOT in idents seems silly when the return character is permitted in idents if you use a different syntax (\a), for example. So one must say xxx\axxx to get newline into an ident, but can say "xxx\nxxx" to get it into a string. Of course, \n and \a aren't really synonyms since "foo\nbar" and "foo\abar" mean radically different things. (sigh) And "foo\n bar" and "foo\a bar" also mean different things. (sigh again) * I don't see any size limitations on the tokens that are identifiers or strings, is that right? I can live with this, but it's messy. If I find out that other people are secretly getting away with bounds and that they're not telling me I could be doing likewise, I'll be a bit bummed out, though. For example, is there a deep dark secret truth about Lex's max buffer size that I'm not being told about? Also here are a few comments about the parsing of numbers: * Is there REALLY no limit on the number of digits before and after the decimal point in a num? I don't really mind there not being. I'm working in Lisp and we have arbitrary-precision arithmetic, but geez--I thought this was a major headache for those guys with languages like C and Java that don't boast arbitrary-precision arithmetic. * The production for num does not generate negative numbers. There's no provision for a leading hyphen even though it would not lead to any ambiguity for there to be a leading hyphen. There are some examples of stuff (e.g., in 9.8.4 Absolute Positioning) where negative numbers appear to be used such as -100px which looks to me like a DIMENSION which is defined as {num}{ident} in 4.1.1, and which therefore doesn't look valid. IF I AM CORRECTLY UNDERSTANDING THE SITUATION, THIS ERROR LOOKS SEVERE.
Received on Tuesday, 23 June 1998 04:45:52 UTC