W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > July to September 2005

Regex [+-] syntax

From: Michael Kay <mike@saxonica.com>
Date: Mon, 15 Aug 2005 21:35:55 +0100
To: <www-xml-schema-comments@w3.org>
Message-ID: <E1E4lgh-0006zW-Sm@maggie.w3.org>


(previously raised, without getting a response, on xmlschema-dev)

The text defining regular expressions in Appendix F Schema Part 2 Second
Edition (28 Oct 2004) seems to be inconsistent between the BNF and the
accompanying prose. 

In particular, what characters are allowed to appear as s and e in a range
[s-e]?

The production rules say

[18]   	seRange	   ::=   	charOrEsc '-' charOrEsc
[20]   	charOrEsc	   ::=   	XmlChar | SingleCharEsc
[21]   	XmlChar	   ::=   	[^\#x2D#x5B#x5D]

which imply that [, ], \, and - are disallowed in both positions.

But the text then elaborates this by saying that 

s-e is a valid character range iff:

    * s is a .single character escape., or an XML character;
    * s is not \
    * If s is the first character in a .character class expression., then s
is not ^
    * e is a .single character escape., or an XML character;
    * e is not \ or [; and
    * The code point of e is greater than or equal to the code point of s; 

Question: in this English text, what does "XML character" mean? Does it mean
any character allowed in XML, or does it mean XmlChar as defined in
production 21? I guess the latter, since many of the technical terms in this
section seem to be expanded versions of the names of production rules. But
many of these are written between middle dots, and this one isn't, so I
might be guessing wrong. Also, if it means XmlChar, then bullets 2 and 5 are
completely redundant.

The grammar rules say that \ and [ are disallowed in both positions, but the
English rules say \ is disallowed for the start of the range while both \
and [ are disallowed for the end. Why the inconsistency? Why is "-" not
mentioned in the prose?

Furthermore, the text below production rule 22 says:

# The [, ], - and \ characters are not valid character ranges;
# The ^ character is only valid at the beginning of a .positive character
group. if it is part of a .negative character group.
# The - character is a valid character range only at the beginning or end of
a .positive character group..

Bullets 1 and 3 seem to say different things about "-": presumably bullet 3
is intended to take precedence. One also feels that bullet 3 could be
expressed more helpfully: for example "The - character is a valid character
range only if it is the first character in a positive character group, or if
it is followed by "]" or "-", in which case it is taken as the last
character in a positive character group."


Michael Kay
http://www.saxonica.com/
Received on Monday, 15 August 2005 20:36:07 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 6 December 2009 18:13:08 GMT