- From: Ashok Malhotra <ashokma@microsoft.com>
- Date: Tue, 16 Jul 2002 12:51:01 -0700
- To: "W3C XML Schema Comments list" <www-xml-schema-comments@w3.org>
- Cc: <Paul.V.Biron@kp.org>
- Message-ID: <E5B814702B65CB4DA51644580E4853FB014888EE@red-msg-12.redmond.corp.microsoft.com>
In the regular expression specification in XML Schema part 2, The first set of rules declares "The ^ character is only valid at the beginning of a *positive character group* <http://www.w3.org/TR/xmlschema-2/> if it is part of a *negative character group* <http://www.w3.org/TR/xmlschema-2/> ", where negCharGroup ::- '^' posCharGroup does this in fact mean that the following are true: 1) The ^ character may appear anywhere except the first position of a <pcg>, if the <pcg> is not part of a <ncg>. (It is obvious that the ^ cannot be in the first position of a <pcg>, or it would match the definition of a <ncg>) E.g. [a^b] is a legal <pcg>, but [^ab] is of course a <ncg> If the <pcg> is part of a <ncg>, then any position of the <pcg> may contain a ^. E.g. [^^abc] and [^ab^c] are both legal. 2) The second set of rules state that an 's-e character range' describes a range of XML characters. Is this second set of rules building upon the first set of rules above them? We think not, looking at the BNF production above both sets of rules - XmlCharIncDash is defined for a single character, whereas XmlChar/SingleCharEsc are defined for a character range. What is confusing is if the second set of rules is not built on top of the first set of rules, then the rules 's is not \' and 'e is not \ or [' are redundant because the production of XmlChar already disallows the characters '\', '[', and ']'. If these rules are not redundant, then that means the following is implied to be legal: i. [ab]-c] matches 'a', 'b', and any character between ']' (U+5D) and 'c' (U+63). It seems strange that this would be legal, because it would be complicated for a regex parser to know that if a '-' follows a ']' in a <pcg>, then this ']' *may* not signify the end of the <pcg>. But if the regex is [ab]-c, then that ']' would signify the end of the <pcg>, and it would match 'a' or 'b' followed by '-c'. Many thanks! All the best, Ashok =========================================================== Ashok Malhotra <mailto: ashokma@microsoft.com> Microsoft Corporation 212 Hessian Hills Road Croton-On-Hudson, NY 10520 USA Redmond: 425-703-9462 New York: 914-271-6477
Received on Tuesday, 16 July 2002 15:51:34 UTC