- From: <bugzilla@jessica.w3.org>
- Date: Sun, 23 Feb 2014 14:48:25 +0000
- To: www-xml-schema-comments@w3.org
https://www.w3.org/Bugs/Public/show_bug.cgi?id=24780 Bug ID: 24780 Summary: Request to clarify proper use of explicit UCS code point numbers in regular expression. Product: XML Schema Version: 1.0/1.1 both Hardware: PC OS: Windows NT Status: NEW Severity: minor Priority: P2 Component: Datatypes: XSD Part 2 Assignee: David_E3@VERIFONE.com Reporter: spectrum777@outlook.com QA Contact: www-xml-schema-comments@w3.org CC: cmsmcq@blackmesatech.com The XSD 1.0 and 1.1 drafts do not currently offer clarification on how one can specify characters within regular expression character class expressions by using UCS code point numbers explicitly. The drafts do currently use EBNF notation to indicate certain characters by using #xNN where NN is the hex value of the character, but that doesn't help clarify the former. The following offers information on my experience, which, if I'm not blindly missing something, may be worthwhile feedback. For example, consider the following from the 1.1 XSD draft (also, I believe, in the 1.0 draft): NormalChar ::= [^.\?*+{}()|#x5B#x5D] Obviously, this is using a notation, not for the XSD author, but rather as a documentation convention utilized to convey something to the author. This is evident in the fact that #x5B within an actual character class expression would be considered to be the characters #, X, 5, and B (4 characters), not a single character #x5B. Since the XSD specs don't clarify this, the implication seems to be that it falls back into the XML spec's clarification of character references... that a character reference would be the way to specify any additional characters within a regular expression. It might be nice to offer some form of clarification about this in the XSD drafts/specs. More info... To try to clarify this for myself, I searched within the XSD and XML drafts and found the following: ... Note: The notation #xA used here (and elsewhere in this specification) represents the Universal Character Set (UCS) code point hexadecimal A (line feed), which is denoted by U+000A. This notation is to be distinguished from 
, which is the XML character reference to that same UCS code point. ... After much consideration, I believe the two are being distinguished solely because the #xNN format is purely EBNF for documentation purposes. This latter "NOTE" also seems to imply that using character references are the only other form of escape one can use within a regular expression beyond those defined by the XSD regex docs. In the same drafts, I see the following: ... This specification makes use of the EBNF notation used in the [XML] specification. Note that some constructs of the EBNF notation used here resemble the regular-expression syntax defined in this specification (Regular Expressions (§G)), but that they are not identical: there are differences. For a fuller description of the EBNF notation, see Section 6. Notation of the [XML] specification. ... This clarification is confusing for someone trying to understand XSD regular expression specifics because the specification referred to above defines those regular expressions using EBNF. The regular expression syntax is called out using EBNF, yet ENBF itself is not exactly what the author of an XSD regular expression should use. Perhaps this is not generally bad, but it's confusing for someone like me who is trying to clarify what UCS code point escaping options are available for use within a XSD regex... once again, per above, the heavy implication is that, if it isn't in the XSD draft, the XML spec is the fallback... whatever it allows is what one can use (and expect tools to support). This issue arose because I was receiving an error from software which did not like an XML configuration file I had created. It was well-formed so I wanted to verify its format against an XSD. An xmllint tool detected a failure. The failure was due to usages within a regex pattern of \xNN hex characters in the XSD regex. However, other XSD/XML validation software accepted the \xNN. That led me to what to know whom was "right" (I realize these are drafts and implementations can differ. But I was wondering if I was missing some addendum/errata or something right in the drafts themselves.) Some clarification here may be nice to have... if I missed something pls excuse in advance. Thanks. -- You are receiving this mail because: You are the QA Contact for the bug.
Received on Sunday, 23 February 2014 14:48:30 UTC