Re: [ACTION-385] Common regular expression syntax from Jörg Schütz on 2013-01-28 (public-multilingualweb-lt@w3.org from January 2013)

From: Jörg Schütz <joerg@bioloom.de>
Date: Mon, 28 Jan 2013 08:32:11 +0100
To: public-multilingualweb-lt@w3.org
Message-ID: <5106297B.9040409@bioloom.de>
Hi Shaun,

Very good analysis! The identified subset should be fairly sufficient 
for the ITS 2.0 regex needs although it is quite limited...

Cheers -- Jörg

On Jan 27, 2013, at 18:30 (CET), Shaun McCance wrote:
> I've investigated features in six different regular expression
> dialects to try to find a safe common subset for the allowed
> characters data category. I tested Java, .Net, XSD, JavaScript,
> Perl, and Python. I still want to test POSIX EREs, and PHP may
> be good to test as well, given the focus on CMSs in 2.0. But I
> think the subset from the six I tested is going to be safe in
> general.
>
> I only tested RE features found inside character classes, i.e.
> stuff between '[' and ']'. Everything was tested on Fedora 14.
> Java with OpenJDK 1.6.0. .Net with Mono 2.6.7. JavaScript with
> Firefox 3.6.12. XSD with libxml2 2.7.7. Perl 5.12.2. Python 2.7.
>
> Notes: For Perl I had to use "use utf8;" to get anything beyond
> ASCII to work right. For Python, I had to use Unicode objects
> with the u'' notation to get non-ASCII right. Python 3 probably
> does this better, but I didn't test it.
>
> On with it:
>
>
> locale-dependant
>
> Unicode Classes (\p{L}, etc) are supported by everything but
> JavaScript. Perl and Java recognize some shorthand classes,
> but I believe there's a base set that's compatible, except
> for JavaScript.
>
> POSIX character classes ([:digit:], etc) are only supported
> by Perl and POSIX tools like grep. They're out.
>
> Escaping "^": In all dialects, you can escape "^" with "\".
> In everything but XSD, you can just put "^" somewhere other
> than the beginning of the class. This sounds insane to me.
>
> Escaping "]": In all dialects, you can escape "]" with "\".
> In everything but XSD and JavaScript, you can use it as the
> first character.
>
> \Q...\E expressions are only supported in Java and Perl.
>
> In all dialects, "-" can be escaped with a "\" or by putting
> it at the beginning or end of the character class.
>
> Character class substitutions (e.g. [a-z-[aeiou]]) are only
> supported in XSD. I read that .Net supports them, but that
> didn't pan out in my tests. It could be a newer addition to
> the standard, or it could be that Mono is buggy (rare).
>
> Octal escapes (e.g. \135 for "]") are supported in Python,
> Perl, and JavaScript. Hex escapes (e.g. \x5D for "]") are
> supported in all but XSD. Unicode escapes (e.g. \u2234 for
> "∴") are supported in all but XSD and Perl. This one makes
> me sad. I really wish we could use Unicode escapes safely.
>
> Everything supports "\n", "\r", and "\t".
>
> XSD and JavaScript don't support "\a" for U+0007 BELL. XSD,
> JavaScript, and Python don't support "\e" for U+001B ESCAPE.
> Not big losses. XSD is the only dialect that doesn't support
> "\f" for U+000C FORM FEED.
>
> Java, .Net, JavaScript, and Python support "\v" to match
> U+000B LINE TABULATION only. In Perl, "\v" matches anything
> it calls vertical whitespace, U+000A through U+000D. True
> to fashion, XSD doesn't support "\v" at all.
>
> Control code escapes: \cA through \cZ means U+0001 through
> U+001A in Java, .Net, JavaScript, and Perl. In XSD, the \c
> escape means something entirely different. Python doesn't
> support \c.
>
> Every dialect seems to support \d and \D and agree on what
> they actually mean.
>
> In XSD and .Net, \w matches lots of Unicode word characters.
> In the others, it matches [A-Za-z0-9_]. Although at least
> for Python, the documentation says it's locale-dependent.
> See my note on that below. They all support \W, with the
> same compatibility problem.
>
> Every dialect support \s for whitespace, but they all have
> different definitions of whitespace. In XSD, \s matches
> space, tab, carriage return, line feed. In Java, Perl, and
> Python, \s matches those plus vertical tab and form feed.
> In JavaScript, \s matches all sorts of Unicode whitespace
> characters, like non-breaking spaces and zero-width spaces.
> They all support \S, with the same compatibility problem.
>
> In some dialects, some behavior changes based on locale.
> This is dangerous, and I believe we should avoid all such
> behavior. To the extent that it's useful, it should not
> be based on the locale the program is running in, but on
> the locale you're translating to or (possibly) from. And
> for the latter, we'd have to define an interaction with
> langPointer.
>
> So what I think this leaves us with is character classes
> [abc], ranges [a-c], and negations [^abc], there "^" and
> "]" must never appear unless backslash-escaped, "-" may
> be backslash-escaped or put at the beginning or end, the
> escape sequences "\n", "\r", "\t", "\d", and "\D" may be
> used, and literal "\" is escaped as "\\".
>
> Importantly, you must never have an unescaped backslash,
> because some dialects may treat it as the beginning of
> an escape sequence that means something special.
>
> This is a very limited subset, but I think it's what we
> have to use. I'm now going to try to make a portable RE
> that matches these portable RE character classes.
>
> Comments?
>
> --
> Shaun
>
Received on Monday, 28 January 2013 07:32:33 UTC