[Bug 24780] New: Request to clarify proper use of explicit UCS code point numbers in regular expression. from bugzilla@jessica.w3.org on 2014-02-23 (www-xml-schema-comments@w3.org from January to March 2014)

From: <bugzilla@jessica.w3.org>
Date: Sun, 23 Feb 2014 14:48:25 +0000
To: www-xml-schema-comments@w3.org
Message-ID: <bug-24780-703@http.www.w3.org/Bugs/Public/>
https://www.w3.org/Bugs/Public/show_bug.cgi?id=24780

            Bug ID: 24780
           Summary: Request to clarify proper use of explicit UCS code
                    point numbers in regular expression.
           Product: XML Schema
           Version: 1.0/1.1 both
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Datatypes: XSD Part 2
          Assignee: David_E3@VERIFONE.com
          Reporter: spectrum777@outlook.com
        QA Contact: www-xml-schema-comments@w3.org
                CC: cmsmcq@blackmesatech.com

The XSD 1.0 and 1.1 drafts do not currently offer clarification on how one can
specify characters within regular expression character class expressions by
using UCS code point numbers explicitly. The drafts do currently use EBNF
notation to indicate certain characters by using #xNN where NN is the hex value
of the character, but that doesn't help clarify the former. The following
offers information on my experience, which, if I'm not blindly missing
something, may be worthwhile feedback.

For example, consider the following from the 1.1 XSD draft (also, I believe, in
the 1.0 draft):

    NormalChar ::= [^.\?*+{}()|#x5B#x5D]

Obviously, this is using a notation, not for the XSD author, but rather as a
documentation convention utilized to convey something to the author. This is
evident in the fact that #x5B within an actual character class expression would
be considered to be the characters #, X, 5, and B (4 characters), not a single
character #x5B.

Since the XSD specs don't clarify this, the implication seems to be that it
falls back into the XML spec's clarification of character references... that a
character reference would be the way to specify any additional characters
within a regular expression. 

It might be nice to offer some form of clarification about this in the XSD
drafts/specs. 

More info...

To try to clarify this for myself, I searched within the XSD and XML drafts and
found the following:

... Note: The notation #xA used here (and elsewhere in this specification)
represents the Universal Character Set (UCS) code point hexadecimal A (line
feed), which is denoted by U+000A.  This notation is to be distinguished from
&#xA;, which is the XML character reference to that same UCS code point. ...

After much consideration, I believe the two are being distinguished solely
because the #xNN format is purely EBNF for documentation purposes. This latter
"NOTE" also seems to imply that using character references are the only other
form of escape one can use within a regular expression beyond those defined by
the XSD regex docs.

In the same drafts, I see the following:

... This specification makes use of the EBNF notation used in the [XML]
specification. Note that some constructs of the EBNF notation used here
resemble the regular-expression syntax defined in this specification (Regular
Expressions (§G)), but that they are not identical: there are differences. For
a fuller description of the EBNF notation, see Section 6. Notation of the [XML]
specification. ...

This clarification is confusing for someone trying to understand XSD regular
expression specifics because the specification referred to above defines those
regular expressions using EBNF. The regular expression syntax is called out
using EBNF, yet ENBF itself is not exactly what the author of an XSD regular
expression should use. Perhaps this is not generally bad, but it's confusing
for someone like me who is trying to clarify what UCS code point escaping
options are available for use within a XSD regex... once again, per above, the
heavy implication is that, if it isn't in the XSD draft, the XML spec is the
fallback... whatever it allows is what one can use (and expect tools to
support).  

This issue arose because I was receiving an error from software which did not
like an XML configuration file I had created. It was well-formed so I wanted to
verify its format against an XSD. An xmllint tool detected a failure. The
failure was due to usages within a regex pattern of \xNN hex characters in the
XSD regex. However, other XSD/XML validation software accepted the \xNN. That
led me to what to know whom was "right" (I realize these are drafts and
implementations can differ. But I was wondering if I was missing some
addendum/errata or something right in the drafts themselves.) 

Some clarification here may be nice to have... if I missed something pls excuse
in advance. Thanks.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
Received on Sunday, 23 February 2014 14:48:30 UTC