RE: [ACTION-385] Common regular expression syntax from Yves Savourel on 2013-01-27 (public-multilingualweb-lt@w3.org from January 2013)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Sun, 27 Jan 2013 15:31:16 -0700
To: <public-multilingualweb-lt@w3.org>
Message-ID: <assp.0739a73800.assp.0739e90827.017201cdfcde$05ab8b00$1102a100$@com>
Hi Shaun,

Thanks for the thorough analysis.
That should be enough for the goals of the data category.

cheer,
-yves


-----Original Message-----
From: Shaun McCance [mailto:shaunm@gnome.org] 
Sent: Sunday, January 27, 2013 10:30 AM
To: public-multilingualweb-lt@w3.org
Subject: [ACTION-385] Common regular expression syntax

I've investigated features in six different regular expression dialects to try to find a safe common subset for the allowed characters data category. I tested Java, .Net, XSD, JavaScript, Perl, and Python. I still want to test POSIX EREs, and PHP may be good to test as well, given the focus on CMSs in 2.0. But I think the subset from the six I tested is going to be safe in general.

I only tested RE features found inside character classes, i.e.
stuff between '[' and ']'. Everything was tested on Fedora 14.
Java with OpenJDK 1.6.0. .Net with Mono 2.6.7. JavaScript with Firefox 3.6.12. XSD with libxml2 2.7.7. Perl 5.12.2. Python 2.7.

Notes: For Perl I had to use "use utf8;" to get anything beyond ASCII to work right. For Python, I had to use Unicode objects with the u'' notation to get non-ASCII right. Python 3 probably does this better, but I didn't test it.

On with it:


locale-dependant

Unicode Classes (\p{L}, etc) are supported by everything but JavaScript. Perl and Java recognize some shorthand classes, but I believe there's a base set that's compatible, except for JavaScript.

POSIX character classes ([:digit:], etc) are only supported by Perl and POSIX tools like grep. They're out.

Escaping "^": In all dialects, you can escape "^" with "\".
In everything but XSD, you can just put "^" somewhere other than the beginning of the class. This sounds insane to me.

Escaping "]": In all dialects, you can escape "]" with "\".
In everything but XSD and JavaScript, you can use it as the first character.

\Q...\E expressions are only supported in Java and Perl.

In all dialects, "-" can be escaped with a "\" or by putting it at the beginning or end of the character class.

Character class substitutions (e.g. [a-z-[aeiou]]) are only supported in XSD. I read that .Net supports them, but that didn't pan out in my tests. It could be a newer addition to the standard, or it could be that Mono is buggy (rare).

Octal escapes (e.g. \135 for "]") are supported in Python, Perl, and JavaScript. Hex escapes (e.g. \x5D for "]") are supported in all but XSD. Unicode escapes (e.g. \u2234 for
"∴") are supported in all but XSD and Perl. This one makes me sad. I really wish we could use Unicode escapes safely.

Everything supports "\n", "\r", and "\t".

XSD and JavaScript don't support "\a" for U+0007 BELL. XSD, JavaScript, and Python don't support "\e" for U+001B ESCAPE.
Not big losses. XSD is the only dialect that doesn't support "\f" for U+000C FORM FEED.

Java, .Net, JavaScript, and Python support "\v" to match
U+000B LINE TABULATION only. In Perl, "\v" matches anything
it calls vertical whitespace, U+000A through U+000D. True to fashion, XSD doesn't support "\v" at all.

Control code escapes: \cA through \cZ means U+0001 through
U+001A in Java, .Net, JavaScript, and Perl. In XSD, the \c
escape means something entirely different. Python doesn't support \c.

Every dialect seems to support \d and \D and agree on what they actually mean.

In XSD and .Net, \w matches lots of Unicode word characters.
In the others, it matches [A-Za-z0-9_]. Although at least for Python, the documentation says it's locale-dependent.
See my note on that below. They all support \W, with the same compatibility problem.

Every dialect support \s for whitespace, but they all have different definitions of whitespace. In XSD, \s matches space, tab, carriage return, line feed. In Java, Perl, and Python, \s matches those plus vertical tab and form feed.
In JavaScript, \s matches all sorts of Unicode whitespace characters, like non-breaking spaces and zero-width spaces.
They all support \S, with the same compatibility problem.

In some dialects, some behavior changes based on locale.
This is dangerous, and I believe we should avoid all such behavior. To the extent that it's useful, it should not be based on the locale the program is running in, but on the locale you're translating to or (possibly) from. And for the latter, we'd have to define an interaction with langPointer.

So what I think this leaves us with is character classes [abc], ranges [a-c], and negations [^abc], there "^" and "]" must never appear unless backslash-escaped, "-" may be backslash-escaped or put at the beginning or end, the escape sequences "\n", "\r", "\t", "\d", and "\D" may be used, and literal "\" is escaped as "\\".

Importantly, you must never have an unescaped backslash, because some dialects may treat it as the beginning of an escape sequence that means something special.

This is a very limited subset, but I think it's what we have to use. I'm now going to try to make a portable RE that matches these portable RE character classes.

Comments?

--
Shaun
Received on Sunday, 27 January 2013 22:31:44 UTC