CSS, level 2 queries re tokenization from Kent Pitman on 1998-06-21 (www-style@w3.org from June 1998)

From: Kent Pitman <kmp@harlequin.com>
Date: Sat, 20 Jun 1998 21:10:07 -0400 (EDT)
To: www-style@w3.org
Cc: kmp@harlequin.com
Message-Id: <9806210109.AA25172@romulus.harlequin.com>
In PR-CSS2-19980324, in 4.1.1 Tokenization ...

* It says that nmstart permits upper and lower case alphabetics, but
  it says that nmchar permits ONLY lower case alphabetics.  Is the
  omission of uppercase for nmchar really intentional?
  For example, bullet 2 in 4.1.3 mentions A-Za-z, i.e. both cases,
  in seeming contradiction to the specified grammar.
  IF THIS DESCRIPTION OF NMCHAR IS WRONG, THE ERROR SEEMS SEVERE TO ME.

* Do you really mean to allow unicode to only be specified in
  lowercase a-f?  Personally, I find this gratuitous, since I prefer
  uppercase hex, but it's not fatal.

* Also, am I right that 'escape' means to include Space through Tilde
  by use of the notation [ -~\200-\4177777]?  Surely something more
  perspicuous could be done.  I almost missed the use of hyphen as a
  connective visually, and thought this said "space and hyphen and tilde..."
  This is not strictly a bug, it's just really ugly, and made worse 
  by your choice of font, in which - and ~ are virtually indistinguishable.

* Same comment for string1 and string2, where it took me forever to figure
  out why A-Z are missing.  I personally think using hyphen to string 
  together anything other than conceptually meaningful sequences like a-f,
  a-z, and 0-9 is not really that good.  I guess I can live with sequences
  of codes.  I'd rather see \050-\177 than "(-~".

* I find the apparent use of decimal to describe character 
  codes in running text [... "space" (Unicode code 32), "tab" (9), ...]
  and octal in your tables [... \200-\4177777 ...] and the fact that
  css will ultimately expect me to write in hex (the presumed reason for
  specifying that macro token 'unicode' can take on [0-9a-f]) to be
  IMMENSELY confusing.  You're using Hex, Decimal, and Octal on the same
  page with no indication about which is in use where.  Whether this is a
  bug or not is hard to say, but speaking as an editor of language standards
  myself, this is a good way to confuse readers.  You should really fix this,
  probably to use Hex uniformly since probably that aspect of the
  language is fixed and the rest can be adjusted most easily to match.
  (I abhor hex, but would rather see hex used consistently than a mix
  used in a way that makes it hard to know what's in use at any given
  time.  e.g., perhaps like the XML spec, you could use
  [#x7F-#x10FFFF] instead of [\200-\4177777].)

* Isn't the sequence " -~\200-\4177777" the SAME as the simpler 
  sequence " -\4177777"?  That is, isn't " -~" the same as "\040-\177"
  and aren't "\177" and "\200" one after another?  I admit it's late at
  night as I write this and it's been a long time since I used octal, but
  it sure looks like something that could be usefully contracted.  It looks
  otherwise like a split sequence instead of bascially "any unicode letter
  from space upward to 04177777".

* Are the conventions for these notations like [...] that you're 
  using defined anywhere?  I looked quickly and didn't see htem.
  Is the remark about them being "Lex-style" the definition?  What if I don't
  have a copy of Lex?  Could you perhaps offer a pointer to a publicly
  accessible copy of its spec?  Or could you explain the relevant parts so
  I don't need to go in search?  An actual standard makes a good reference,
  but if Lex doesn't have an associated standard, it's probably not fair
  to assume your reader uses it.  This is not a way to write a standard that
  stands on its own through the ages.

If I'm looking at an obsolete document and there's a later fix, 
please do let me know.
 -kmp
Received on Sunday, 21 June 1998 14:40:48 UTC