Re: [Issue-67] [Action-385] Work on regex for validating regex subset proposal from Felix Sasaki on 2013-04-06 (public-multilingualweb-lt@w3.org from April 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Sat, 06 Apr 2013 22:45:49 +0200
To: Pablo Nieto Caride <pablo.nieto@linguaserve.com>
CC: 'Jirka Kosek' <jirka@kosek.cz>, public-multilingualweb-lt@w3.org
Message-ID: <5160897D.1070809@w3.org>
Hi Pablo, all,

I had a look at the test suite again and found this kinds of regexes:

[a-zA-Z_\-]
[^*+]
[ &#xFF01;–&#xFF5E;]
[&#x0020;-&#x00FE;]
[^*+]

Maybe it would help to do the ABNF approach that Pablo mentioned and 
restrict us with that. See an ABNF below.

========
allowedCharacters = start 1*range end ["+"]

start = "["

end = "]"

range = char / char "-" char

char = [neg] BMP+escapes

neg = "^"

========

This means: the regex must always start with "[" and end with "]". In 
the brackets there must be at least one range. The range can be just one 
or more characters or a range in the form of character "-" character.
The character is "char" which optionally can be forbidden via "^". 
BMP+escapes then is the Unicode BMP, including the escapes of characters 
like "[", "]", "-" etc.

This is more restricted than what Shaun proposed at
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0180.html
but looking at the test suite and the use case of allowed characters it 
seems to cover everything.

Using the ABNF would not mean to drop the regex. started working on an 
XML Schema / RELAX NG regex implementing above ABNF, and it looks pretty 
straightforward.

Thoughts?

Best,

Felix


Am 06.04.13 19:18, schrieb Pablo Nieto Caride:
> Hi Felix, all,
>
>
> On Apr 6, 2013, at 1:25 PM, Felix Sasaki <fsasaki@w3.org> wrote:
>
>> Hi Pablo,
>>
>> sorry for the effort, but to move this forward, we need at least make sure that at least the test suite reg ex examples work.
>>
>> I checked
>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/inputdata/allowedcharacters/xml
>> by replacing in my local copy of the test suite
>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/schema/its20-types.rng
>> this part
>>   <define name="its-allowedCharacters.type">
>>     <data type="string"></data>
>>   </define>
>> with this, that is inside the "pattern" element your regex for XML validation:
>>     <data type="string">
>>       <param name="pattern">^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$</param>
>>     </data>
>>
>> That gave me validation errors like this one:
>>
>> [jing] /its2.0/inputdata/allowedcharacters/xml/allowedcharacters7xmlrules.xml:3:100: error: Bad value ?[^*+]? for attribute ?allowedCharacters? on element ?allowedCharactersRule? from namespace ?http://www.w3.org/2005/11/its?.
>>
>> Could you change in your local copy of the test suite the "param" element with your regex so that the validation for all test suite files for allowed characters
>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/inputdata/allowedcharacters/
>> works?
>>
>> FYI, the content is an XML Schema regular expression, so your XML version for validation should work finally, I think.
>>
>> Again, sorry for the effort, but it would be great to have this done before the next publication, that is by Thursday next week. Would that work for you?
>>
> I doing some testing with the files you sent me to see how XSD works with regex and I'm seeing weird things, like problems with ^ and $ to set the beginning and end of the regex, I'm still working on it.
>
> I will do as you say and change my local copy of the schema to validate the Test Suite files.
>
> Yes I think it'll work for me, there is time and I think I'm close to the solution. Sorry but it got more complicated than I initially expected.
>
> Cheers,
> Pablo.
>
>> Best,
>>
>> Felix
>>
>> Am 05.04.13 15:39, schrieb Pablo Nieto Caride:
>>> Hi Felix,
>>>
>>> Yes I tried Allowed Characters Test-Suite's  example before to make sure that the regex worked, and [a-zA-Z_\-] works for me in my system, anyway I'll try what you suggest and get back to you as soon as I have the results.
>>>
>>> Cheers,
>>> Pablo.
>>> __________________________________
>>>
>>> Hi Pablo, all,
>>>
>>> Am 05.04.13 11:24, schrieb Pablo Nieto Caride:
>>>> Hi all,
>>>>
>>>> I have completed the regex. Finally I decided to restrict it to Plane 0 (Basic Multilingual Plane 0000-FFFF) because I think is sufficient and otherwise the regex would be very complex, besides Shaun didn't actually limit it to Plane 1 (Supplementary Multilingual Plane 10000–1FFFF) but to Planes 15-16 (10FFFF) which is too much. I understand it covers the basics (now escapes of [, ], ^ and -) and does not match incorrect regex, such as "[f-", supports the greedy and lazy wildcard (this is not really necessary), and does not support nested character classes (do we need them? They are rarely used in general). Please test it:
>>>> 1) Here is the proposed regular expression escaped with XML numeric character entities, as if it were put into an XML document:
>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&
>>>> #x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\
>>>> \-)|(\\))+(-)?([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#x
>>>> D7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>> I tried that with an [a-zA-Z_\-]
>>> but got a validation error. Could you check a few examples from https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/inputdata/allowedcharacters/html/
>>> to make sure that the regex works? E.g. by creating a schema like the attached one and check with the regex?
>>>
>>>
>>> Best,
>>>
>>> Felix
>>>> 2) Here it is with \x{}, for Perl/PCRE only:
>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>>>
>>>> 3) And here is a regular expression that matches a subset of our subset, limited to Plane 0, with the \u escape:
>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>>>
>>>> 4) And remember, the backslashes and escaped backslashes are significant to the regular expression engine. If you're putting that into a string in a language like Java or C#, you need to escape the escapes:
>>>> re = new Regex("^((\\.((\\*|\\+)|(\\*\\?|\\+\\?))?)|(\\[\\^?(([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+(-)?([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+)*\\]))*$");
>>>>
>>>> I'll proceed now to draft text explaining importance of Unicode normalization and best practices, that's Action-430.
>>>>
>>>> Cheers,
>>>> Pablo.
>>>> __________________________________
>>>>
>>>> Hi Jirka,
>>>>
>>>> It should not match invalid expressions since it only support character classes, ranges and negations, but still needs a bit of polishing regarding escapes. I don't think we need a BNF grammar, but it's not mine to decide, I just doing what I'm supposed to.
>>>>
>>>> Cheers,
>>>> Pablo.
>>>> __________________________________
>>>>
>>>> On 4.4.2013 17:12, Pablo Nieto Caride wrote:
>>>>> Please, implementers and whoever that is interested, give feedback if
>>>>> necessary so I can move forward and evolve the regex.
>>>> Hi,
>>>>
>>>> since such complex regular expressions are mostly write-only (it's very hard to understand what they are trying to match) I'm not sure what's the point of having this complex regular expression for checking our regular expression syntax subset. I haven't tried to get deep understanding of this expression but I bet it will match even invalid expressions. If we want to have rigorous definition of our RE syntax we should provide its definition as grammar written in BNF.
>>>>
>>>>      Jirka
>>>>
>>>> --
>>>> ------------------------------------------------------------------
>>>>     Jirka Kosek      e-mail: jirka@kosek.cz      http://xmlguru.cz
>>>> ------------------------------------------------------------------
>>>>          Professional XML consulting and training services
>>>>     DocBook customization, custom XSLT/XSL-FO document processing
>>>> ------------------------------------------------------------------
>>>>    OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
>>>> ------------------------------------------------------------------
>>>>       Bringing you XML Prague conference    http://xmlprague.cz
>>>> ------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>>
>>
>
Received on Saturday, 6 April 2013 20:46:27 UTC