- From: Jörg Schütz <joerg@bioloom.de>
- Date: Mon, 28 Jan 2013 08:32:11 +0100
- To: public-multilingualweb-lt@w3.org
Hi Shaun, Very good analysis! The identified subset should be fairly sufficient for the ITS 2.0 regex needs although it is quite limited... Cheers -- Jörg On Jan 27, 2013, at 18:30 (CET), Shaun McCance wrote: > I've investigated features in six different regular expression > dialects to try to find a safe common subset for the allowed > characters data category. I tested Java, .Net, XSD, JavaScript, > Perl, and Python. I still want to test POSIX EREs, and PHP may > be good to test as well, given the focus on CMSs in 2.0. But I > think the subset from the six I tested is going to be safe in > general. > > I only tested RE features found inside character classes, i.e. > stuff between '[' and ']'. Everything was tested on Fedora 14. > Java with OpenJDK 1.6.0. .Net with Mono 2.6.7. JavaScript with > Firefox 3.6.12. XSD with libxml2 2.7.7. Perl 5.12.2. Python 2.7. > > Notes: For Perl I had to use "use utf8;" to get anything beyond > ASCII to work right. For Python, I had to use Unicode objects > with the u'' notation to get non-ASCII right. Python 3 probably > does this better, but I didn't test it. > > On with it: > > > locale-dependant > > Unicode Classes (\p{L}, etc) are supported by everything but > JavaScript. Perl and Java recognize some shorthand classes, > but I believe there's a base set that's compatible, except > for JavaScript. > > POSIX character classes ([:digit:], etc) are only supported > by Perl and POSIX tools like grep. They're out. > > Escaping "^": In all dialects, you can escape "^" with "\". > In everything but XSD, you can just put "^" somewhere other > than the beginning of the class. This sounds insane to me. > > Escaping "]": In all dialects, you can escape "]" with "\". > In everything but XSD and JavaScript, you can use it as the > first character. > > \Q...\E expressions are only supported in Java and Perl. > > In all dialects, "-" can be escaped with a "\" or by putting > it at the beginning or end of the character class. > > Character class substitutions (e.g. [a-z-[aeiou]]) are only > supported in XSD. I read that .Net supports them, but that > didn't pan out in my tests. It could be a newer addition to > the standard, or it could be that Mono is buggy (rare). > > Octal escapes (e.g. \135 for "]") are supported in Python, > Perl, and JavaScript. Hex escapes (e.g. \x5D for "]") are > supported in all but XSD. Unicode escapes (e.g. \u2234 for > "∴") are supported in all but XSD and Perl. This one makes > me sad. I really wish we could use Unicode escapes safely. > > Everything supports "\n", "\r", and "\t". > > XSD and JavaScript don't support "\a" for U+0007 BELL. XSD, > JavaScript, and Python don't support "\e" for U+001B ESCAPE. > Not big losses. XSD is the only dialect that doesn't support > "\f" for U+000C FORM FEED. > > Java, .Net, JavaScript, and Python support "\v" to match > U+000B LINE TABULATION only. In Perl, "\v" matches anything > it calls vertical whitespace, U+000A through U+000D. True > to fashion, XSD doesn't support "\v" at all. > > Control code escapes: \cA through \cZ means U+0001 through > U+001A in Java, .Net, JavaScript, and Perl. In XSD, the \c > escape means something entirely different. Python doesn't > support \c. > > Every dialect seems to support \d and \D and agree on what > they actually mean. > > In XSD and .Net, \w matches lots of Unicode word characters. > In the others, it matches [A-Za-z0-9_]. Although at least > for Python, the documentation says it's locale-dependent. > See my note on that below. They all support \W, with the > same compatibility problem. > > Every dialect support \s for whitespace, but they all have > different definitions of whitespace. In XSD, \s matches > space, tab, carriage return, line feed. In Java, Perl, and > Python, \s matches those plus vertical tab and form feed. > In JavaScript, \s matches all sorts of Unicode whitespace > characters, like non-breaking spaces and zero-width spaces. > They all support \S, with the same compatibility problem. > > In some dialects, some behavior changes based on locale. > This is dangerous, and I believe we should avoid all such > behavior. To the extent that it's useful, it should not > be based on the locale the program is running in, but on > the locale you're translating to or (possibly) from. And > for the latter, we'd have to define an interaction with > langPointer. > > So what I think this leaves us with is character classes > [abc], ranges [a-c], and negations [^abc], there "^" and > "]" must never appear unless backslash-escaped, "-" may > be backslash-escaped or put at the beginning or end, the > escape sequences "\n", "\r", "\t", "\d", and "\D" may be > used, and literal "\" is escaped as "\\". > > Importantly, you must never have an unescaped backslash, > because some dialects may treat it as the beginning of > an escape sequence that means something special. > > This is a very limited subset, but I think it's what we > have to use. I'm now going to try to make a portable RE > that matches these portable RE character classes. > > Comments? > > -- > Shaun >
Received on Monday, 28 January 2013 07:32:33 UTC