W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > July to September 2002

Two questions on Regex's

From: Ashok Malhotra <ashokma@microsoft.com>
Date: Tue, 16 Jul 2002 12:51:01 -0700
Message-ID: <E5B814702B65CB4DA51644580E4853FB014888EE@red-msg-12.redmond.corp.microsoft.com>
To: "W3C XML Schema Comments list" <www-xml-schema-comments@w3.org>
Cc: <Paul.V.Biron@kp.org>
In the regular expression specification in XML Schema part 2,
The first set of rules declares "The ^ character is only valid at the
beginning of a *positive character group*
<http://www.w3.org/TR/xmlschema-2/>  if it is part of a *negative
character group* <http://www.w3.org/TR/xmlschema-2/> ", where
    negCharGroup ::- '^' posCharGroup

does this in fact mean that the following are true:
1)	The ^ character may appear anywhere except the first position of
a <pcg>, if the <pcg> is not part of a <ncg>. (It is obvious that the ^
cannot be in the first position of a <pcg>, or it would match the
definition of a <ncg>)        E.g. [a^b] is a legal <pcg>, but [^ab] is
of course a <ncg>
      If the <pcg> is part of a <ncg>, then any position of the <pcg>
may contain a ^. E.g. [^^abc] and [^ab^c] are both
      legal.
2)	The second set of rules state that an 's-e character range'
describes a range of XML characters. Is this second set of rules
building upon the first set of rules above them? We think not, looking
at the BNF production above both sets of rules - XmlCharIncDash is
defined for a single character, whereas XmlChar/SingleCharEsc are
defined for a character range. What is confusing is if the second set of
rules is not built on top of the first set of rules, then the rules 's
is not \' and 'e is not \ or [' are redundant because the production of
XmlChar already disallows the characters '\', '[', and ']'. If these
rules are not redundant, then that means the following is implied to be
legal:
i.	[ab]-c] matches 'a', 'b', and any character between ']' (U+5D)
and 'c' (U+63). It seems strange that this would be legal, because it
would be complicated for a regex parser to know that if a '-' follows a
']' in a <pcg>, then this ']' *may* not signify the end of the <pcg>.
But if the regex is [ab]-c, then that ']' would signify the end of the
<pcg>, and it would match 'a' or 'b' followed by '-c'.

Many thanks!

All the best, Ashok 
===========================================================
Ashok Malhotra              <mailto: ashokma@microsoft.com> 
Microsoft Corporation
212 Hessian Hills Road
Croton-On-Hudson, NY 10520 USA 
Redmond: 425-703-9462                New York: 914-271-6477 
Received on Tuesday, 16 July 2002 15:51:34 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 6 December 2009 18:13:01 GMT