Re: [ACTION-385] Common regular expression syntax from Felix Sasaki on 2013-02-04 (public-multilingualweb-lt@w3.org from February 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Mon, 04 Feb 2013 23:34:38 +0100
To: public-multilingualweb-lt@w3.org
Message-ID: <5110377E.7060609@w3.org>
Hi Shaun, all,

Am 04.02.13 22:28, schrieb Shaun McCance:
> On Mon, 2013-02-04 at 13:46 -0700, Yves Savourel wrote:
>> Hi Shaun,
>>
>> Many thanks for this Shaun.
>>
>> I've added it to our ITS processing to check the its:allowedCharacters
>> value and noticed that some of the test files have the expression "[^*
>> +]" which seems to be not valid based this checking expression. (I
>> still have to make sure my validation code is right).
>>
>> Is that the case? If yes, how would we express "any chars but '*' and
>> '+'"?
> My mistake. It seems "^" maintains special meaning even when not
> at the beginning of the expression or a character class, so we
> have to escape it.
>
> ^(\.|
> \[\^?-?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E-&#x5B;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;&#x10000;-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E-&#x5B;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;&#x10000;-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$

Thanks from me too, a lot for this. I tried above regex for the ITS 
schema, fixing some character escapes, see e.g. "-#x10FFFF;" > 
"-&#x10FFFF;" and came up with the below:

^(\.|\[\^?-?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5B;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;&#x10000;-&#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5B;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;&#x10000;-&#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$

But it seems that this doesn't work yet, I'm getting a hard to track 
error: invalid regex, missing "]". Can you have another look?

Thanks,

Felix

>
> ^(\.|
> \[\^?-?(([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$
>
> ^(\.|\[\^?-?(([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005B\u005F-\uD7FF
> \uE000-\uFFFD]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\u0009\u000A\u000D
> \u0020-\u002C\u002E-\u005B\u005F-\uD7FF\uE000-\uFFFD\u10000-#x10FFFF]|\
> \n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$
>
> re = new Regex("^(\\.|\\[\\^?-?(([\\u0009\\u000A\\u000D\\u0020-\\u002C\
> \u002E-\\u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\
> \\^|\\\\-|\\\\\\\\)(-([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\
> \u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\\\^|\\\
> \-|\\\\\\\\))?)+-?\\])?$");
>
>> cheers,
>> -yves
>>
>> -----Original Message-----
>> From: Shaun McCance [mailto:shaunm@gnome.org]
>> Sent: Monday, February 04, 2013 11:53 AM
>> To: public-multilingualweb-lt@w3.org
>> Subject: Re: [ACTION-385] Common regular expression syntax
>>
>> On Sun, 2013-01-27 at 12:30 -0500, Shaun McCance wrote:
>>> So what I think this leaves us with is character classes [abc], ranges
>>> [a-c], and negations [^abc], there "^" and "]" must never appear
>>> unless backslash-escaped, "-" may be backslash-escaped or put at the
>>> beginning or end, the escape sequences "\n", "\r", "\t", "\d", and
>>> "\D" may be used, and literal "\" is escaped as "\\".
>>>
>>> Importantly, you must never have an unescaped backslash, because some
>>> dialects may treat it as the beginning of an escape sequence that
>>> means something special.
>>>
>>> This is a very limited subset, but I think it's what we have to use.
>>> I'm now going to try to make a portable RE that matches these portable
>>> RE character classes.
>> Upon further investigation, it seems some engines allow Unicode characters outside 0-9 for \d, so that's out too. There's an open question of what characters can be referred to. I decided to use the definition of Char in XML 1.0:
>>
>> http://www.w3.org/TR/REC-xml/#charsets
>>
>> It's hard to reference these, because many of the range boundary characters are unassigned, so effectively unprintable. I think we don't want to embed the literal character U+D7FF in the spec.
>>
>> Here is the proposed regular expression escaped with XML numeric character entities, as if it were put into an XML document:
>>
>> ^(\.|
>> \[^?-?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E-&#x5B;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;&#x10000;-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E-&#x5B;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;&#x10000;-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$
>>
>> (Email will almost certainly add line breaks. Ignore them.)
>>
>> There are two ways I know of to escape characters (not bytes) in different engines: \x{2234} and \u2234. The \u syntax can only reference Plane 1 characters, and works in everything except XSD and Perl/PCRE. The \x{} syntax is only Perl/PCRE, but can specify any character.
>>
>> Here it is with \x{}, for Perl/PCRE only:
>>
>> ^(\.|
>> \[^?-?(([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$
>>
>> And here is a regular expression that matches a subset of our subset, limited to Plane 1, with the \u escape:
>>
>> ^(\.|\[^?-?(([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005B\u005F-\uD7FF
>> \uE000-\uFFFD]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\u0009\u000A\u000D
>> \u0020-\u002C\u002E-\u005B\u005F-\uD7FF\uE000-\uFFFD\u10000-#x10FFFF]|\
>> \n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$
>>
>> And remember, the backslashes and escaped backslashes are significant to the regular expression engine. If you're putting that into a string in a language like Java or C#, you need to escape the escapes:
>>
>> re = new Regex("^(\\.|\\[^?-?(([\\u0009\\u000A\\u000D\\u0020-\\u002C\
>> \u002E-\\u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\
>> \\^|\\\\-|\\\\\\\\)(-([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\
>> \u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\\\^|\\\
>> \-|\\\\\\\\))?)+-?\\])?$");
>>
>> --
>> Shaun
>>
>>
>>
>>
>>
>
>
Received on Monday, 4 February 2013 22:35:02 UTC