[Bug 2216] R-224: Questions about metacharacters in regular expressions

http://www.w3.org/Bugs/Public/show_bug.cgi?id=2216





------- Comment #5 from davep@iit.edu  2008-05-18 05:32 -------
(In reply to comment #0)
>> Appendix F in the Part 2 of XML Schema 1.0 defines 'metacharacter' thus: 
> 
> A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. 
> 
> It defines 'normal character' thus: 
> 
> [Definition:] A normal character is any XML character that is not a 
> metacharacter. In regular expressions, a normal character is an atom that 
> denotes the singleton set of strings containing only itself. 
> 
> Production [10], which I take to be defining normal characters, reads: 
> 
> Normal Character [10] Char ::= [^.\?*+()|#x5B#x5D] 
> 
> The metacharacters all need escapes, so production 24 is also relevant here: 
> 
> Single Character Escape [24] SingleCharEsc ::= '\' [nrt\|.?*+(){}
> #x2D#x5B#x5D#x5E] 
> 
> I have some questions: 

> 3. should ^ (#x5E) be included among the metacharacters? 

> I suspect...the answer to (3) is 'no, on the 
> theory that the term 'metacharacter' is best reserved for characters which have 
> special meaning at the top level of a regular expression and which must 
> therefore have escapes to avoid ambiguity. Hyphen, circumflex, comma, n, r, and 
> t all have special meaning only in special contexts (within character groups, 
> within quantity-range specifications, or after backslash), and so aren't 
> metacharacters in this sense. 

Let me define characters used autonymously (self-naming) as those which act as
single-character classes containing themselves, and metacharacters as those
which are not being used autonymously, with the understanding that the same
character in different occurrences in an RE can be one or the other.  I'll call
the characters selected by the "metacharacter" nonterminal "top-level
metacharacters" or "TLMs".  "top-level" refers to "outside of a character class
expression".

In top-level, many of the TLMs can occur where other characters can occur
autonymously; in those locations the TLM would have to be escaped to have
autonymous effect.  There are other top-level places were a TLM cannot be a
legal metacharacter and could presumably be used autonymously.  But the
designers of the language apparently didn't want the users to have to wonder,
so they made it possible and required that the TLMs always be escaped.  (For
that matter, a few TLMs cannot be used as metacharacters in a location where an
autonymous character can occur, but that's the language design.)

Within character class expressions, only a few TLMs can be used as
metacharacters, also '^' (which is not a TLM) can be so used.  The autonymous
vs meta rules are different here; there is no blanket prohibition of potential
metacharacters being used autonymously; rather, there are some rules specifying
where they can and can't be so used.  (A few TLMs still never can be
autonymous, those that can't be metacharacters here can always be autonymous,
and for '-' and '^' the rules allow each at different places.)  But since '^'
can't be used as a metacharacter in the top-level, it is not in the TLM list.

All the TLMs and '^' are *permitted* to be escaped if their autonymous use is
wanted; this is so that if a user is not sure if it can be meta at a given
location and wants autonymous usage, they can just escape it and be sure to get
the effect they want.  That's why '^' is in the single-character-escape list.

Are we having fun yet?  ;-)

Received on Sunday, 18 May 2008 05:32:59 UTC