W3C home > Mailing lists > Public > public-multilingualweb-lt-comments@w3.org > February 2013

Re: [ACTION-385] Common regular expression syntax

From: Dave Lewis <dave.lewis@cs.tcd.ie>
Date: Sun, 03 Feb 2013 21:01:18 +0000
Message-ID: <510ED01E.5070909@cs.tcd.ie>
To: public-multilingualweb-lt-comments@w3.org
Hi Yves,
In relation to ISSUE-67, as the original commenter, can you confirm you 
are satisfied with the resolution that we use the limited regex subset 
Shaun identified in response to ACTION-385 in the Allowed Characters 
data category, together with an accompanying note on best practice for 
unicode normalisation that Sahun is addressing under ACTION-430 (Draft 
text explaining importance of Unicode normalization and best practices 
on ISSUE-67)?


On 27/01/2013 22:31, Yves Savourel wrote:
> Hi Shaun,
> Thanks for the thorough analysis.
> That should be enough for the goals of the data category.
> cheer,
> -yves
> -----Original Message-----
> From: Shaun McCance [mailto:shaunm@gnome.org]
> Sent: Sunday, January 27, 2013 10:30 AM
> To: public-multilingualweb-lt@w3.org
> Subject: [ACTION-385] Common regular expression syntax
> I've investigated features in six different regular expression dialects to try to find a safe common subset for the allowed characters data category. I tested Java, .Net, XSD, JavaScript, Perl, and Python. I still want to test POSIX EREs, and PHP may be good to test as well, given the focus on CMSs in 2.0. But I think the subset from the six I tested is going to be safe in general.
> So what I think this leaves us with is character classes [abc], ranges [a-c], and negations [^abc], there "^" and "]" must never appear unless backslash-escaped, "-" may be backslash-escaped or put at the beginning or end, the escape sequences "\n", "\r", "\t", "\d", and "\D" may be used, and literal "\" is escaped as "\\".
> Importantly, you must never have an unescaped backslash, because some dialects may treat it as the beginning of an escape sequence that means something special.
> This is a very limited subset, but I think it's what we have to use. I'm now going to try to make a portable RE that matches these portable RE character classes.
> Comments?
> --
> Shaun
Received on Sunday, 3 February 2013 21:02:02 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:32:26 UTC