[ACTION-385] Common regular expression syntax from Shaun McCance on 2013-01-27 (public-multilingualweb-lt@w3.org from January 2013)

From: Shaun McCance <shaunm@gnome.org>
Date: Sun, 27 Jan 2013 12:30:07 -0500
To: public-multilingualweb-lt@w3.org
Message-ID: <1359307807.2220.248.camel@recto>
I've investigated features in six different regular expression
dialects to try to find a safe common subset for the allowed
characters data category. I tested Java, .Net, XSD, JavaScript,
Perl, and Python. I still want to test POSIX EREs, and PHP may
be good to test as well, given the focus on CMSs in 2.0. But I
think the subset from the six I tested is going to be safe in
general.

I only tested RE features found inside character classes, i.e.
stuff between '[' and ']'. Everything was tested on Fedora 14.
Java with OpenJDK 1.6.0. .Net with Mono 2.6.7. JavaScript with
Firefox 3.6.12. XSD with libxml2 2.7.7. Perl 5.12.2. Python 2.7.

Notes: For Perl I had to use "use utf8;" to get anything beyond
ASCII to work right. For Python, I had to use Unicode objects
with the u'' notation to get non-ASCII right. Python 3 probably
does this better, but I didn't test it.

On with it:


locale-dependant

Unicode Classes (\p{L}, etc) are supported by everything but
JavaScript. Perl and Java recognize some shorthand classes,
but I believe there's a base set that's compatible, except
for JavaScript.

POSIX character classes ([:digit:], etc) are only supported
by Perl and POSIX tools like grep. They're out.

Escaping "^": In all dialects, you can escape "^" with "\".
In everything but XSD, you can just put "^" somewhere other
than the beginning of the class. This sounds insane to me.

Escaping "]": In all dialects, you can escape "]" with "\".
In everything but XSD and JavaScript, you can use it as the
first character.

\Q...\E expressions are only supported in Java and Perl.

In all dialects, "-" can be escaped with a "\" or by putting
it at the beginning or end of the character class.

Character class substitutions (e.g. [a-z-[aeiou]]) are only
supported in XSD. I read that .Net supports them, but that
didn't pan out in my tests. It could be a newer addition to
the standard, or it could be that Mono is buggy (rare).

Octal escapes (e.g. \135 for "]") are supported in Python,
Perl, and JavaScript. Hex escapes (e.g. \x5D for "]") are
supported in all but XSD. Unicode escapes (e.g. \u2234 for
"∴") are supported in all but XSD and Perl. This one makes
me sad. I really wish we could use Unicode escapes safely.

Everything supports "\n", "\r", and "\t".

XSD and JavaScript don't support "\a" for U+0007 BELL. XSD,
JavaScript, and Python don't support "\e" for U+001B ESCAPE.
Not big losses. XSD is the only dialect that doesn't support
"\f" for U+000C FORM FEED.

Java, .Net, JavaScript, and Python support "\v" to match
U+000B LINE TABULATION only. In Perl, "\v" matches anything
it calls vertical whitespace, U+000A through U+000D. True
to fashion, XSD doesn't support "\v" at all.

Control code escapes: \cA through \cZ means U+0001 through
U+001A in Java, .Net, JavaScript, and Perl. In XSD, the \c
escape means something entirely different. Python doesn't
support \c.

Every dialect seems to support \d and \D and agree on what
they actually mean.

In XSD and .Net, \w matches lots of Unicode word characters.
In the others, it matches [A-Za-z0-9_]. Although at least
for Python, the documentation says it's locale-dependent.
See my note on that below. They all support \W, with the
same compatibility problem.

Every dialect support \s for whitespace, but they all have
different definitions of whitespace. In XSD, \s matches
space, tab, carriage return, line feed. In Java, Perl, and
Python, \s matches those plus vertical tab and form feed.
In JavaScript, \s matches all sorts of Unicode whitespace
characters, like non-breaking spaces and zero-width spaces.
They all support \S, with the same compatibility problem.

In some dialects, some behavior changes based on locale.
This is dangerous, and I believe we should avoid all such
behavior. To the extent that it's useful, it should not
be based on the locale the program is running in, but on
the locale you're translating to or (possibly) from. And
for the latter, we'd have to define an interaction with
langPointer.

So what I think this leaves us with is character classes
[abc], ranges [a-c], and negations [^abc], there "^" and
"]" must never appear unless backslash-escaped, "-" may
be backslash-escaped or put at the beginning or end, the
escape sequences "\n", "\r", "\t", "\d", and "\D" may be
used, and literal "\" is escaped as "\\".

Importantly, you must never have an unescaped backslash,
because some dialects may treat it as the beginning of
an escape sequence that means something special.

This is a very limited subset, but I think it's what we
have to use. I'm now going to try to make a portable RE
that matches these portable RE character classes.

Comments?

--
Shaun
Received on Sunday, 27 January 2013 17:30:35 UTC