Regex comments from James Clark on 2000-12-05 (www-xml-schema-comments@w3.org from October to December 2000)

From: James Clark <jjc@jclark.com>
Date: Tue, 05 Dec 2000 15:23:36 +0700
To: www-xml-schema-comments@w3.org
Message-ID: <3A2CA608.AEE52AF1@jclark.com>

Some comments on Appendix F of Schema Part 2.

1. The section seems to be crying out for a formal grammar.

2. The definition of character class escapes should mention "block
escapes". (It also should say that the "valid character class escapes
*are* ..." not "include ...".)

3. The terminology in the description of category escapes is broken. 
"Lu", "Ll" etc are not character properties but are possible values of
the "General Category" property.  It is not satisfactory to say "the
following table specifies the main character properties".  There needs
to be a precise statement of exactly what is allowed as a category
escape.  It seems like what you mean is any two-letter sequence that
occurs as the value of the General Category property of some character,
or the first letter of such a two-letter sequence. It would be helpful
to refer to Section 4.5 of Unicode3.

4. It seems strange to have an escape for name characters but not for
name start characters (the characters allowed at the beginning of a
name).  This means I cannot conveniently write a regex that matches XML
names. (Or cannot I do it with \c

5. It would be helpful to say exactly where the definitive list of block
names is to be found: in the Blocks.txt file of the Unicode Character
Database (http://www.unicode.org/Public/UNIDATA/Blocks.txt). The Unicode
standard itself doesn't quite do it: for example, the chart for 000-007F
is enttiled "C0 Controls and Basic Latin", whereas Blocks.txt calls it
simply "Basic Latin".

6. If I turn the prose description of character class subtraction into a
grammar I get:

character class ::= character class escape | character class expression
character class expression ::= '[' character group ']'
character group ::= positive character group
                  | negative character group
                  | character class subtraction
negative character group ::= '^' , positive character group
character class subtraction ::= (positive character group | negative
character group)
                                '-' character class expression

which suggests that a character class subtraction looks like:

 [abc-[def]]

If this is right, it's deeply confusing that the description of \w uses
an incompatible syntax: [...]-[...].  It is also a pretty bizarre
feature: is this really necessary? I couldn't find any mention of it in
the Regexp documentation I consulted. Overloading '-' for two completely
different operations doesn't seem like a good design.

James

Received on Tuesday, 5 December 2000 03:25:26 UTC