W3C home > Mailing lists > Public > public-multilingualweb-lt-comments@w3.org > February 2013

RE: [ACTION-385] Common regular expression syntax

From: Yves Savourel <ysavourel@enlaso.com>
Date: Sun, 3 Feb 2013 14:32:35 -0700
To: <public-multilingualweb-lt-comments@w3.org>
Message-ID: <assp.0746f270eb.assp.0746b95acb.013601ce0255$fbf4c040$f3de40c0$@com>
Hi Dave,

Yes, the sub-set described by Shaun in http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0180.html would be fine.
A note on best practices for Unicode normalization would be fine too.


-----Original Message-----
From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie] 
Sent: Sunday, February 03, 2013 2:01 PM
To: public-multilingualweb-lt-comments@w3.org
Subject: Re: [ACTION-385] Common regular expression syntax

Hi Yves,
In relation to ISSUE-67, as the original commenter, can you confirm you are satisfied with the resolution that we use the limited regex subset Shaun identified in response to ACTION-385 in the Allowed Characters data category, together with an accompanying note on best practice for unicode normalisation that Sahun is addressing under ACTION-430 (Draft text explaining importance of Unicode normalization and best practices on ISSUE-67)?


On 27/01/2013 22:31, Yves Savourel wrote:
> Hi Shaun,
> Thanks for the thorough analysis.
> That should be enough for the goals of the data category.
> cheer,
> -yves
> -----Original Message-----
> From: Shaun McCance [mailto:shaunm@gnome.org]
> Sent: Sunday, January 27, 2013 10:30 AM
> To: public-multilingualweb-lt@w3.org
> Subject: [ACTION-385] Common regular expression syntax
> I've investigated features in six different regular expression dialects to try to find a safe common subset for the allowed characters data category. I tested Java, .Net, XSD, JavaScript, Perl, and Python. I still want to test POSIX EREs, and PHP may be good to test as well, given the focus on CMSs in 2.0. But I think the subset from the six I tested is going to be safe in general.
> So what I think this leaves us with is character classes [abc], ranges [a-c], and negations [^abc], there "^" and "]" must never appear unless backslash-escaped, "-" may be backslash-escaped or put at the beginning or end, the escape sequences "\n", "\r", "\t", "\d", and "\D" may be used, and literal "\" is escaped as "\\".
> Importantly, you must never have an unescaped backslash, because some dialects may treat it as the beginning of an escape sequence that means something special.
> This is a very limited subset, but I think it's what we have to use. I'm now going to try to make a portable RE that matches these portable RE character classes.
> Comments?
> --
> Shaun
Received on Sunday, 3 February 2013 21:33:08 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 19:55:32 UTC