XML 1.0 - Ill-defined notational devices from Kent M Pitman on 1998-04-17 (xml-editor@w3.org from April to June 1998)

From: Kent M Pitman <kmp@harlequin.com>
Date: Fri, 17 Apr 98 04:29:57 EDT
To: xml-editor@w3.org
Cc: kmp@harlequin.com
Message-Id: <9804170829.AA01348@excel.harlequin.com>

I reported this one last summer while you were in draft stage, but no
action was taken.  Maybe you were just too busy.  Anyway, my problem
with it all didn't go away.  I have continued concern about and objection
to the present [...] notational devices--both in the way they are defined 
and the way they are used...

(1) The notation 

     [a-zA-Z]

    is too-briefly described in chapter 6 as 'matches any character with a
    value in the range(s) indicated (inclusive).'  I think this needs 
    elaboration.  At the VERY least, it should say 'from a to z and from
    A to Z'.

(2) WORSE, the notation [a-zA-Z0-9_.:] is NOWHERE defined.  Indeed, the
    notation [abc] is not even defined.

(3) [^abc] is only scantily defined, although one must infer from context
    using superhuman skills that the "^" is part of the "not" notation and 
    not part of the characters that are disallowed.  Without more exposition,
    there is no way to discern that [^abc] doesn't mean
          Char - ( '^' | 'a' | 'b' | 'c' )
    since there is no use of [...] shown and one might therefore assume
    that when hyphens are not present, there is an exclusion applied.

(4) If you assume [abc] is defined as meaning the enclosed characters,
    then how do you know that [#x12-#x14] doesn't mean
        '#' | 'x' | '-' | '1' | '2' | '4' 
    ?  My conclusion is that you can't let this go without saying.
    It may be that people can figure this spec out pragmatically,
    but it is not the case that the spec really DEFINES a notation plainly.

Personally, I would MUCH rather not see a hairy definition for [].
I would rather see see a simple syntax definition of [], EVEN IF 
it led to more complex notations like:

 [a-z] | [A-Z] | [0-9] | '_' | '.' | ':' 

and even if instead of [^abc] you saw:

 Char - ('a' | 'b' | 'c')

Another thing I like about  "Char - ('a' | 'b' | 'c')" is that it makes
clear what the set is that abc are being removed from.  When you don't
specify, it might mean Char or it might be some other set.

Among other things, using a more cumbersome notation would encourage 
you to name these odd little collections of characters.  Why on earth
is "_", ".", and ":" allowed in one case but another arbitrary-looking
set in another context??  If you named these better, and used descriptions
like:

 lc-alpha | uc-alpha | digit | nameprefix

in place of 

 [a-zA-Z0-9_.:]

it would make a lot more sense and would have a normative effect on the
terminology used by parser-writers to describe these odd little sets.
 -kmp

-----------
DISCLAIMER:
 The above are my personal feelings and not necessarily 
 Harlequin's official position.

Received on Friday, 17 April 1998 04:26:39 UTC