- From: Yves Savourel <ysavourel@enlaso.com>
- Date: Sun, 3 Feb 2013 14:32:35 -0700
- To: <public-multilingualweb-lt-comments@w3.org>
Hi Dave, Yes, the sub-set described by Shaun in http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0180.html would be fine. A note on best practices for Unicode normalization would be fine too. Thanks, -yves -----Original Message----- From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie] Sent: Sunday, February 03, 2013 2:01 PM To: public-multilingualweb-lt-comments@w3.org Subject: Re: [ACTION-385] Common regular expression syntax Hi Yves, In relation to ISSUE-67, as the original commenter, can you confirm you are satisfied with the resolution that we use the limited regex subset Shaun identified in response to ACTION-385 in the Allowed Characters data category, together with an accompanying note on best practice for unicode normalisation that Sahun is addressing under ACTION-430 (Draft text explaining importance of Unicode normalization and best practices on ISSUE-67)? Regards, Dave On 27/01/2013 22:31, Yves Savourel wrote: > Hi Shaun, > > Thanks for the thorough analysis. > That should be enough for the goals of the data category. > > cheer, > -yves > > > -----Original Message----- > From: Shaun McCance [mailto:shaunm@gnome.org] > Sent: Sunday, January 27, 2013 10:30 AM > To: public-multilingualweb-lt@w3.org > Subject: [ACTION-385] Common regular expression syntax > > I've investigated features in six different regular expression dialects to try to find a safe common subset for the allowed characters data category. I tested Java, .Net, XSD, JavaScript, Perl, and Python. I still want to test POSIX EREs, and PHP may be good to test as well, given the focus on CMSs in 2.0. But I think the subset from the six I tested is going to be safe in general. [...] > So what I think this leaves us with is character classes [abc], ranges [a-c], and negations [^abc], there "^" and "]" must never appear unless backslash-escaped, "-" may be backslash-escaped or put at the beginning or end, the escape sequences "\n", "\r", "\t", "\d", and "\D" may be used, and literal "\" is escaped as "\\". > > Importantly, you must never have an unescaped backslash, because some dialects may treat it as the beginning of an escape sequence that means something special. > > This is a very limited subset, but I think it's what we have to use. I'm now going to try to make a portable RE that matches these portable RE character classes. > > Comments? > > -- > Shaun > > > > >
Received on Sunday, 3 February 2013 21:33:08 UTC