- From: <bugzilla@wiggum.w3.org>
- Date: Fri, 17 Nov 2006 00:49:18 +0000
- To: www-xml-schema-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=1889 ------- Comment #5 from mike@saxonica.com 2006-11-17 00:49 ------- Since there doesn't seem to be much effort going into resolving this, and since it accounts for a significant proportion of the problems I am having in matching the published test suite results, let me propose a solution. PROPOSAL (a) leave the grammar unchanged (b) in each of the definitions in App. F, where the term being defined is spelt differently from the corresponding metasymbol, add a cross-reference. For example: "Definition: A regular expression (regExp) is composed from zero or more ·branch·es, separated by | characters." This is to remove any ambiguity about whether the term "XML Character" is a reference to the metasymbol XMLChar or to some other concept with a similar name... (c) expand the definition of Character Range: [Definition:] A character range (charRange) R identifies a set of characters C(R) containing all XML characters with UCS code points in a specified range. (d) replace the text below rule 22 as follows: There are two forms of character range: a ·start-end range·, and a ·single-character range·. A character or ·single character escape· is taken as the start of a ·start-end range· if (a) it is valid as such, and (b) it is immediately followed by a hyphen. Otherwise (if it is valid as such) it is taken as a ·single-character range·. [Definition:] A ·start-end range· (seRange) s-e identifies the set that contains all XML characters with UCS code points greater than or equal to the code point of s, but not greater than the code point of e. For s-e to be a valid character range, it must satisfy the following rules in addition to those implied by the grammar: * If s is the first character in a ·character class expression·, then s is not ^ * The code point of e is greater than or equal to the code point of s; Note: The code point of a ·single character escape· is the code point of the single character in the set of characters that it identifies. [Definition:] A ·single XML character· (XMLChar) is a ·character range· that identifies the set of characters containing only itself. For a character to be a valid ·character range·, it must satisfy the following rules in addition to those implied by the grammar: * The ^ character is only valid at the beginning of a ·positive character group· if it is part of a ·negative character group· * The - character is a valid ·character range· only (a) at the beginning of a ·positive character group·, or (b) if immediately followed by a ']' character Note: An unescaped - character is handled as follows. If it appears at the start of a ·positive character group· or immediately before a ']' character then it is taken as representing a literal hyphen. If it appears immediately before a '[' character it is taken as representing a subtraction operator (regardless whether what follows is a valid ·character class expression·). If it appears immediately after a character or character escape that is valid as the start of a ·start-end range·, then it causes that character or character escape to be treated as the start of a ·start-end range·. If it appears anywhere else (for example, after another hyphen, or after the end of a ·start-end range· but not followed by '['), then it is an error. NOTE ON PROPOSAL Some regex implementations are more permissive than this. For example, they allow - as the start or end of a start-end range, and they allow constructs such as [0-9-A-Z] meaning zero-to-nine, hyphen, or A-Z.
Received on Friday, 17 November 2006 02:27:52 UTC