W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > February 2013

Re: [ACTION-385] Common regular expression syntax

From: Shaun McCance <shaunm@gnome.org>
Date: Mon, 04 Feb 2013 13:52:48 -0500
To: public-multilingualweb-lt@w3.org
Message-ID: <1360003968.2220.345.camel@recto>
On Sun, 2013-01-27 at 12:30 -0500, Shaun McCance wrote:
> So what I think this leaves us with is character classes
> [abc], ranges [a-c], and negations [^abc], there "^" and
> "]" must never appear unless backslash-escaped, "-" may
> be backslash-escaped or put at the beginning or end, the
> escape sequences "\n", "\r", "\t", "\d", and "\D" may be
> used, and literal "\" is escaped as "\\".
> Importantly, you must never have an unescaped backslash,
> because some dialects may treat it as the beginning of
> an escape sequence that means something special.
> This is a very limited subset, but I think it's what we
> have to use. I'm now going to try to make a portable RE
> that matches these portable RE character classes.

Upon further investigation, it seems some engines allow Unicode
characters outside 0-9 for \d, so that's out too. There's an open
question of what characters can be referred to. I decided to use
the definition of Char in XML 1.0:


It's hard to reference these, because many of the range boundary
characters are unassigned, so effectively unprintable. I think
we don't want to embed the literal character U+D7FF in the spec.

Here is the proposed regular expression escaped with XML numeric
character entities, as if it were put into an XML document:


(Email will almost certainly add line breaks. Ignore them.)

There are two ways I know of to escape characters (not bytes) in
different engines: \x{2234} and \u2234. The \u syntax can only
reference Plane 1 characters, and works in everything except XSD
and Perl/PCRE. The \x{} syntax is only Perl/PCRE, but can specify
any character.

Here it is with \x{}, for Perl/PCRE only:


And here is a regular expression that matches a subset of our
subset, limited to Plane 1, with the \u escape:


And remember, the backslashes and escaped backslashes are significant
to the regular expression engine. If you're putting that into a string
in a language like Java or C#, you need to escape the escapes:

re = new Regex("^(\\.|\\[^?-?(([\\u0009\\u000A\\u000D\\u0020-\\u002C\

Received on Monday, 4 February 2013 18:53:11 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 21:08:28 UTC