W3C home > Mailing lists > Public > www-xml-schema-comments@w3.org > July to September 2003

Question about metacharacters, regex rule 10, 24 (Datatypes appendix F)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Thu, 10 Jul 2003 15:22:26 -0600
Message-Id: <>
To: W3C XML Schema Comments list <www-xml-schema-comments@w3.org>

Appendix F in the Part 2 of XML Schema 1.0 defines 'metacharacter'

   A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ].

It defines 'normal character' thus:

   [Definition:] A normal character is any XML character that is not a
   metacharacter. In regular expressions, a normal character is an
   atom that denotes the singleton set of strings containing only

Production [10], which I take to be defining normal characters, reads:

   Normal Character
   [10]  Char ::= [^.\?*+()|#x5B#x5D]

The metacharacters all need escapes, so production 24 is also relevant

   Single Character Escape
   [24] SingleCharEsc ::= '\' [nrt\|.?*+(){}#x2D#x5B#x5D#x5E]

I have some questions:

(1) shouldn't { and } (braces) be included in production [10]?

   ? [10] Char ::= [^.\?*+{}()|#x5B#x5D]

(2) shouldn't | (vertical bar) be among the characters defined as

(3) should ^ (#x5E) be included among the metacharacters?

(4) would it be possible to list the magic characters in the same
order in 10 and 24, to make eyeball-based comparisons easier?

I suspect the answer to (2) is 'yes' and the answer to (3) is 'no, on
the theory that the term 'metacharacter' is best reserved for
characters which have special meaning at the top level of a regular
expression and which must therefore have escapes to avoid ambiguity.
Hyphen, circumflex, comma, n, r, and t all have special meaning only
in special contexts (within character groups, within quantity-range
specifications, or after backslash), and so aren't metacharacters in
this sense.

But I may be wrong.

Received on Thursday, 10 July 2003 17:22:37 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 23:09:00 UTC