[Bug 1889] Regex [+-] syntax

http://www.w3.org/Bugs/Public/show_bug.cgi?id=1889





------- Comment #5 from mike@saxonica.com  2006-11-17 00:49 -------
Since there doesn't seem to be much effort going into resolving this, and since
it accounts for a significant proportion of the problems I am having in
matching the published test suite results, let me propose a solution.

PROPOSAL

(a) leave the grammar unchanged

(b) in each of the definitions in App. F, where the term being defined is spelt
differently from the corresponding metasymbol, add a cross-reference. For
example: "Definition: A regular expression (regExp) is composed from zero or
more ·branch·es, separated by | characters." This is to remove any ambiguity
about whether the term "XML Character" is a reference to the metasymbol XMLChar
or to some other concept with a similar name...

(c) expand the definition of Character Range:

[Definition:] A character range (charRange) R identifies a set of characters
C(R) containing all XML characters with UCS code points in a specified range. 

(d) replace the text below rule 22 as follows:

There are two forms of character range: a ·start-end range·, and a
·single-character range·. A character or ·single character escape· is taken as
the start of a ·start-end range· if (a) it is valid as such, and (b) it is
immediately followed by a hyphen. Otherwise (if it is valid as such) it is
taken as a ·single-character range·.

[Definition:] A ·start-end range· (seRange) s-e identifies the set that
contains all XML characters with UCS code points greater than or equal to the
code point of s, but not greater than the code point of e.

For s-e to be a valid character range, it must satisfy the following rules in
addition to those implied by the grammar:

    * If s is the first character in a ·character class expression·, then s is
not ^
    * The code point of e is greater than or equal to the code point of s; 

Note:  The code point of a ·single character escape· is the code point of the
single character in the set of characters that it identifies. 

[Definition:] A ·single XML character· (XMLChar) is a ·character range· that
identifies the set of characters containing only itself. For a character to be
a valid ·character range·, it must satisfy the following rules in addition to
those implied by the grammar:

    * The ^ character is only valid at the beginning of a ·positive character
group· if it is part of a ·negative character group·
    * The - character is a valid ·character range· only 
      (a) at the beginning of a ·positive character group·, or
      (b) if immediately followed by a ']' character 

Note: An unescaped - character is handled as follows. If it appears at the
start of a ·positive character group· or immediately before a ']' character
then it is taken as representing a literal hyphen. If it appears immediately
before a '[' character it is taken as representing a subtraction operator
(regardless whether what follows is a valid ·character class expression·). If
it appears immediately after a character or character escape that is valid as
the start of a ·start-end range·, then it causes that character or character
escape to be treated as the start of a ·start-end range·. If it appears
anywhere else (for example, after another hyphen, or after the end of a
·start-end range· but not followed by '['), then it is an error.

NOTE ON PROPOSAL

Some regex implementations are more permissive than this. For example, they
allow - as the start or end of a start-end range, and they allow constructs
such as [0-9-A-Z] meaning zero-to-nine, hyphen, or A-Z. 

Received on Friday, 17 November 2006 02:27:52 UTC