- From: <bugzilla@farnsworth.w3.org>
- Date: Sun, 18 May 2008 05:32:21 +0000
- To: www-xml-schema-comments@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=2216 ------- Comment #5 from davep@iit.edu 2008-05-18 05:32 ------- (In reply to comment #0) >> Appendix F in the Part 2 of XML Schema 1.0 defines 'metacharacter' thus: > > A metacharacter is either ., \, ?, *, +, {, } (, ), [ or ]. > > It defines 'normal character' thus: > > [Definition:] A normal character is any XML character that is not a > metacharacter. In regular expressions, a normal character is an atom that > denotes the singleton set of strings containing only itself. > > Production [10], which I take to be defining normal characters, reads: > > Normal Character [10] Char ::= [^.\?*+()|#x5B#x5D] > > The metacharacters all need escapes, so production 24 is also relevant here: > > Single Character Escape [24] SingleCharEsc ::= '\' [nrt\|.?*+(){} > #x2D#x5B#x5D#x5E] > > I have some questions: > 3. should ^ (#x5E) be included among the metacharacters? > I suspect...the answer to (3) is 'no, on the > theory that the term 'metacharacter' is best reserved for characters which have > special meaning at the top level of a regular expression and which must > therefore have escapes to avoid ambiguity. Hyphen, circumflex, comma, n, r, and > t all have special meaning only in special contexts (within character groups, > within quantity-range specifications, or after backslash), and so aren't > metacharacters in this sense. Let me define characters used autonymously (self-naming) as those which act as single-character classes containing themselves, and metacharacters as those which are not being used autonymously, with the understanding that the same character in different occurrences in an RE can be one or the other. I'll call the characters selected by the "metacharacter" nonterminal "top-level metacharacters" or "TLMs". "top-level" refers to "outside of a character class expression". In top-level, many of the TLMs can occur where other characters can occur autonymously; in those locations the TLM would have to be escaped to have autonymous effect. There are other top-level places were a TLM cannot be a legal metacharacter and could presumably be used autonymously. But the designers of the language apparently didn't want the users to have to wonder, so they made it possible and required that the TLMs always be escaped. (For that matter, a few TLMs cannot be used as metacharacters in a location where an autonymous character can occur, but that's the language design.) Within character class expressions, only a few TLMs can be used as metacharacters, also '^' (which is not a TLM) can be so used. The autonymous vs meta rules are different here; there is no blanket prohibition of potential metacharacters being used autonymously; rather, there are some rules specifying where they can and can't be so used. (A few TLMs still never can be autonymous, those that can't be metacharacters here can always be autonymous, and for '-' and '^' the rules allow each at different places.) But since '^' can't be used as a metacharacter in the top-level, it is not in the TLM list. All the TLMs and '^' are *permitted* to be escaped if their autonymous use is wanted; this is so that if a user is not sure if it can be meta at a given location and wants autonymous usage, they can just escape it and be sure to get the effect they want. That's why '^' is in the single-character-escape list. Are we having fun yet? ;-)
Received on Sunday, 18 May 2008 05:32:59 UTC