- From: Arle Lommel <arle.lommel@dfki.de>
- Date: Sun, 27 Jan 2013 18:39:02 +0100
- To: Shaun McCance <shaunm@gnome.org>
- Cc: "public-multilingualweb-lt@w3.org" <public-multilingualweb-lt@w3.org>
- Message-Id: <A5DFDC4B-1D04-4AEB-B051-488892006488@dfki.de>
Sounds great Shaun. I would have anticipated a larger comparability range than that, but it is what it is and I believe your subset should serve for the intended purpose quite well. For more complex pattern matching requirements we can say that they are outside our scope and have to be handled with other, negotiated mechanisms. Arle -- Arle Lommel Berlin, Germany Skype: arle_lommel Phone (US): +1 707 709 8650 Sent from a mobile device. Please excuse any typos. On Jan 27, 2013, at 18:30, Shaun McCance <shaunm@gnome.org> wrote: > I've investigated features in six different regular expression > dialects to try to find a safe common subset for the allowed > characters data category. I tested Java, .Net, XSD, JavaScript, > Perl, and Python. I still want to test POSIX EREs, and PHP may > be good to test as well, given the focus on CMSs in 2.0. But I > think the subset from the six I tested is going to be safe in > general. > > I only tested RE features found inside character classes, i.e. > stuff between '[' and ']'. Everything was tested on Fedora 14. > Java with OpenJDK 1.6.0. .Net with Mono 2.6.7. JavaScript with > Firefox 3.6.12. XSD with libxml2 2.7.7. Perl 5.12.2. Python 2.7. > > Notes: For Perl I had to use "use utf8;" to get anything beyond > ASCII to work right. For Python, I had to use Unicode objects > with the u'' notation to get non-ASCII right. Python 3 probably > does this better, but I didn't test it. > > On with it: > > > locale-dependant > > Unicode Classes (\p{L}, etc) are supported by everything but > JavaScript. Perl and Java recognize some shorthand classes, > but I believe there's a base set that's compatible, except > for JavaScript. > > POSIX character classes ([:digit:], etc) are only supported > by Perl and POSIX tools like grep. They're out. > > Escaping "^": In all dialects, you can escape "^" with "\". > In everything but XSD, you can just put "^" somewhere other > than the beginning of the class. This sounds insane to me. > > Escaping "]": In all dialects, you can escape "]" with "\". > In everything but XSD and JavaScript, you can use it as the > first character. > > \Q...\E expressions are only supported in Java and Perl. > > In all dialects, "-" can be escaped with a "\" or by putting > it at the beginning or end of the character class. > > Character class substitutions (e.g. [a-z-[aeiou]]) are only > supported in XSD. I read that .Net supports them, but that > didn't pan out in my tests. It could be a newer addition to > the standard, or it could be that Mono is buggy (rare). > > Octal escapes (e.g. \135 for "]") are supported in Python, > Perl, and JavaScript. Hex escapes (e.g. \x5D for "]") are > supported in all but XSD. Unicode escapes (e.g. \u2234 for > "∴") are supported in all but XSD and Perl. This one makes > me sad. I really wish we could use Unicode escapes safely. > > Everything supports "\n", "\r", and "\t". > > XSD and JavaScript don't support "\a" for U+0007 BELL. XSD, > JavaScript, and Python don't support "\e" for U+001B ESCAPE. > Not big losses. XSD is the only dialect that doesn't support > "\f" for U+000C FORM FEED. > > Java, .Net, JavaScript, and Python support "\v" to match > U+000B LINE TABULATION only. In Perl, "\v" matches anything > it calls vertical whitespace, U+000A through U+000D. True > to fashion, XSD doesn't support "\v" at all. > > Control code escapes: \cA through \cZ means U+0001 through > U+001A in Java, .Net, JavaScript, and Perl. In XSD, the \c > escape means something entirely different. Python doesn't > support \c. > > Every dialect seems to support \d and \D and agree on what > they actually mean. > > In XSD and .Net, \w matches lots of Unicode word characters. > In the others, it matches [A-Za-z0-9_]. Although at least > for Python, the documentation says it's locale-dependent. > See my note on that below. They all support \W, with the > same compatibility problem. > > Every dialect support \s for whitespace, but they all have > different definitions of whitespace. In XSD, \s matches > space, tab, carriage return, line feed. In Java, Perl, and > Python, \s matches those plus vertical tab and form feed. > In JavaScript, \s matches all sorts of Unicode whitespace > characters, like non-breaking spaces and zero-width spaces. > They all support \S, with the same compatibility problem. > > In some dialects, some behavior changes based on locale. > This is dangerous, and I believe we should avoid all such > behavior. To the extent that it's useful, it should not > be based on the locale the program is running in, but on > the locale you're translating to or (possibly) from. And > for the latter, we'd have to define an interaction with > langPointer. > > So what I think this leaves us with is character classes > [abc], ranges [a-c], and negations [^abc], there "^" and > "]" must never appear unless backslash-escaped, "-" may > be backslash-escaped or put at the beginning or end, the > escape sequences "\n", "\r", "\t", "\d", and "\D" may be > used, and literal "\" is escaped as "\\". > > Importantly, you must never have an unescaped backslash, > because some dialects may treat it as the beginning of > an escape sequence that means something special. > > This is a very limited subset, but I think it's what we > have to use. I'm now going to try to make a portable RE > that matches these portable RE character classes. > > Comments? > > -- > Shaun > > >
Received on Sunday, 27 January 2013 17:49:16 UTC