W3C home > Mailing lists > Public > public-multilingualweb-lt-comments@w3.org > February 2013

Re: [ACTION-385] Common regular expression syntax

From: Felix Sasaki <fsasaki@w3.org>
Date: Sun, 03 Feb 2013 22:40:51 +0100
Message-ID: <510ED963.1020602@w3.org>
To: "public-multilingualweb-lt-comments@w3.org" <public-multilingualweb-lt-comments@w3.org>
Am 03.02.13 22:32, schrieb Yves Savourel:
> Hi Dave,
>
> Yes, the sub-set described by Shaun in http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0180.html would be fine.
> A note on best practices for Unicode normalization would be fine too.

Hi Dave, Yves, all,

thanks for following up on this. Just to make one point clear: this 
would not resolve the issue.

The action-385 is not closed, see
http://www.w3.org/2013/01/16-mlw-lt-minutes.html#action02
[ *ACTION:* shaun to work on regex for validating regex subset proposal]

See also Shaun at the bottom of
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0180.html

[

I'm now going to try to make a portable RE
that matches these portable RE character classes.

]

We still need a regular expression for validating the regex, via the 
data type definition in the schema. That will also help to avoid 
creating positive and negative test cases for the regex.

Best,

Felix
>
> Thanks,
> -yves
>
>
> -----Original Message-----
> From: Dave Lewis [mailto:dave.lewis@cs.tcd.ie]
> Sent: Sunday, February 03, 2013 2:01 PM
> To: public-multilingualweb-lt-comments@w3.org
> Subject: Re: [ACTION-385] Common regular expression syntax
>
> Hi Yves,
> In relation to ISSUE-67, as the original commenter, can you confirm you are satisfied with the resolution that we use the limited regex subset Shaun identified in response to ACTION-385 in the Allowed Characters data category, together with an accompanying note on best practice for unicode normalisation that Sahun is addressing under ACTION-430 (Draft text explaining importance of Unicode normalization and best practices on ISSUE-67)?
>
> Regards,
> Dave
>
>
>
> On 27/01/2013 22:31, Yves Savourel wrote:
>> Hi Shaun,
>>
>> Thanks for the thorough analysis.
>> That should be enough for the goals of the data category.
>>
>> cheer,
>> -yves
>>
>>
>> -----Original Message-----
>> From: Shaun McCance [mailto:shaunm@gnome.org]
>> Sent: Sunday, January 27, 2013 10:30 AM
>> To: public-multilingualweb-lt@w3.org
>> Subject: [ACTION-385] Common regular expression syntax
>>
>> I've investigated features in six different regular expression dialects to try to find a safe common subset for the allowed characters data category. I tested Java, .Net, XSD, JavaScript, Perl, and Python. I still want to test POSIX EREs, and PHP may be good to test as well, given the focus on CMSs in 2.0. But I think the subset from the six I tested is going to be safe in general.
> [...]
>> So what I think this leaves us with is character classes [abc], ranges [a-c], and negations [^abc], there "^" and "]" must never appear unless backslash-escaped, "-" may be backslash-escaped or put at the beginning or end, the escape sequences "\n", "\r", "\t", "\d", and "\D" may be used, and literal "\" is escaped as "\\".
>>
>> Importantly, you must never have an unescaped backslash, because some dialects may treat it as the beginning of an escape sequence that means something special.
>>
>> This is a very limited subset, but I think it's what we have to use. I'm now going to try to make a portable RE that matches these portable RE character classes.
>>
>> Comments?
>>
>> --
>> Shaun
>>
>>
>>
>>
>>
>
>
>
Received on Sunday, 3 February 2013 21:41:15 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Sunday, 3 February 2013 21:41:16 GMT