Re: [ACTION-385] Common regular expression syntax

Hi Yves,
In relation to ISSUE-67, as the original commenter, can you confirm you 
are satisfied with the resolution that we use the limited regex subset 
Shaun identified in response to ACTION-385 in the Allowed Characters 
data category, together with an accompanying note on best practice for 
unicode normalisation that Sahun is addressing under ACTION-430 (Draft 
text explaining importance of Unicode normalization and best practices 
on ISSUE-67)?

Regards,
Dave



On 27/01/2013 22:31, Yves Savourel wrote:
> Hi Shaun,
>
> Thanks for the thorough analysis.
> That should be enough for the goals of the data category.
>
> cheer,
> -yves
>
>
> -----Original Message-----
> From: Shaun McCance [mailto:shaunm@gnome.org]
> Sent: Sunday, January 27, 2013 10:30 AM
> To: public-multilingualweb-lt@w3.org
> Subject: [ACTION-385] Common regular expression syntax
>
> I've investigated features in six different regular expression dialects to try to find a safe common subset for the allowed characters data category. I tested Java, .Net, XSD, JavaScript, Perl, and Python. I still want to test POSIX EREs, and PHP may be good to test as well, given the focus on CMSs in 2.0. But I think the subset from the six I tested is going to be safe in general.
[...]
> So what I think this leaves us with is character classes [abc], ranges [a-c], and negations [^abc], there "^" and "]" must never appear unless backslash-escaped, "-" may be backslash-escaped or put at the beginning or end, the escape sequences "\n", "\r", "\t", "\d", and "\D" may be used, and literal "\" is escaped as "\\".
>
> Importantly, you must never have an unescaped backslash, because some dialects may treat it as the beginning of an escape sequence that means something special.
>
> This is a very limited subset, but I think it's what we have to use. I'm now going to try to make a portable RE that matches these portable RE character classes.
>
> Comments?
>
> --
> Shaun
>
>
>
>
>

Received on Sunday, 3 February 2013 21:02:02 UTC