W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > July to September 2002

RE: Two questions on Regex's

From: Biron,Paul V <Paul.V.Biron@kp.org>
Date: Tue, 16 Jul 2002 15:43:34 -0700
Message-Id: <8904C60CACA7D51191BC00805FEAAF43D108D2@crdc-exch-7.crdc.kp.org>
To: "'Ashok Malhotra'" <ashokma@microsoft.com>, W3C XML Schema Comments list <www-xml-schema-comments@w3.org>

> -----Original Message-----
> From:	Ashok Malhotra [SMTP:ashokma@microsoft.com]
> Sent:	Tuesday, July 16, 2002 12:51 PM
> To:	W3C XML Schema Comments list
> Cc:	Biron, Paul V
> Subject:	Two questions on Regex's 
> In the regular expression specification in XML Schema part 2,
> The first set of rules declares > "> The ^ character is only valid at the beginning of a Ěpositive character groupĚ <http://www.w3.org/TR/xmlschema-2/> if it is part of a Ěnegative character groupĚ <http://www.w3.org/TR/xmlschema-2/>> "> , where
>     negCharGroup ::- > '> ^> '>  posCharGroup
> does this in fact mean that the following are true:
> 1)      The ^ character may appear anywhere except the first position of a <pcg>, if the <pcg> is not part of a <ncg>. (It is obvious that the ^ cannot be in the first position of a <pcg>, or it would match the definition of a <ncg>)        E.g. [a^b] is a legal <pcg>, but [^ab] is of course a <ncg>
>       If the <pcg> is part of a <ncg>, then any position of the <pcg> may contain a ^. E.g. [^^abc] and [^ab^c] are both
>       legal.

Additionally, an errata has been approved by the WG and is ready for immenent publication that modifies these rules slightly, it is:

modify production [17] so that it becomes:

	[17] charRange ::=  seRange | XmlCharRef 

(that is, delete XmlCharIncDash option and production [22] entirely)

and delete the 3rd bullet of the paragraph below that (i.e., the one that beings "The - character is valid...").

> 2)      The second set of rules state that an > '> s-e character range> '>  describes a range of XML characters. Is this second set of rules building upon the first set of rules above them? We think not, looking at the BNF production above both sets of rules > ->  XmlCharIncDash is defined for a single character, whereas XmlChar/SingleCharEsc are defined for a character range. What is confusing is if the second set of rules is not built on top of the first set of rules, then the rules > '> s is not \> '>  and > '> e is not \ or [> '>  are redundant because the production of XmlChar already disallows the characters > '> \> '> , > '> [> '> , and > '> ]> '> . If these rules are not redundant, then that means the following is implied to be legal:
I'm not sure exactly what the question here is.  Does your thinking about the 2nd set of rules building upon the 1st (I'm not even really sure what that means) change given the above rulling that XmlCharIncDash is now gone?

> i.      [ab]-c] matches > '> a> '> , > '> b> '> , and any character between > '> ]> '>  (U+5D) and > '> c> '>  (U+63). It seems strange that this would be legal, because it would be complicated for a regex parser to know that if a > '> -> '>  follows a > '> ]> '>  in a <pcg>, then this > '> ]> '>  *may* not signify the end of the <pcg>. But if the regex is [ab]-c, then that > '> ]> '>  would signify the end of the <pcg>, and it would match > '> a> '>  or > '> b> '>  followed by > '> -c> '> .
You are correct that [ab]-c] does not match "'ab' followed by any character between ']' and 'c'"...it matches "('a' or 'b') followed by '-c]'".  And you are correct that [ab]-c matches "('a' or 'b') followed by -c".

Is there a specific change to the BNF or prose that you would like to see that would make these interpretations more clear?

Received on Tuesday, 16 July 2002 18:54:58 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:08:59 UTC