W3C home > Mailing lists > Public > xmlschema-dev@w3.org > August 2005

Regex syntax [+-]

From: Michael Kay <mike@saxonica.com>
Date: Thu, 4 Aug 2005 22:50:53 +0100
To: <xmlschema-dev@w3.org>
Message-ID: <E1E0nch-0002Ip-0s@maggie.w3.org>

I'm busy trying to implement the anti-erratum that says [+-] in a regex is
now legal, and I'm therefore trying to understand exactly what the rules now

In particular, what characters are allowed to appear as s and e in a range

The production rules say

[18]   	seRange	   ::=   	charOrEsc '-' charOrEsc
[20]   	charOrEsc	   ::=   	XmlChar | SingleCharEsc
[21]   	XmlChar	   ::=   	[^\#x2D#x5B#x5D]

which imply that [, ], \, and - are disallowed in both positions.

But the text then elaborates this by saying that 

s-e is a valid character range iff:

    * s is a .single character escape., or an XML character;
    * s is not \
    * If s is the first character in a .character class expression., then s
is not ^
    * e is a .single character escape., or an XML character;
    * e is not \ or [; and
    * The code point of e is greater than or equal to the code point of s; 

Question: in this English text, what does "XML character" mean? Does it mean
any character allowed in XML, or does it mean XmlChar as defined in
production 21? (If it means XMLChar, why are bullets 2 and 5 there?)

The grammar rules say that \ and [ are disallowed in both positions, but the
English rules say \ is disallowed for the start of the range while both \
and [ are disallowed for the end. Why the inconsistency? Why is "-" not

I'm left more confused than ever!

Michael Kay
Received on Thursday, 4 August 2005 21:51:34 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 14:56:08 UTC