Re: [Issue-67] [Action-385] Work on regex for validating regex subset proposal from Felix Sasaki on 2013-04-08 (public-multilingualweb-lt@w3.org from April 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Mon, 08 Apr 2013 17:01:38 +0200
To: Pablo Nieto Caride <pablo.nieto@linguaserve.com>
CC: 'Jirka Kosek' <jirka@kosek.cz>, public-multilingualweb-lt@w3.org
Message-ID: <5162DBD2.2060305@w3.org>
Thanks a lot, Pablo. I think your regex does allow things that would be 
forbidden with the ABNF I had proposed (e.g. starting without "["), so 
maybe it is better not to have the ABNF. Otherwise users might be 
confused. How about closing issue-67 by putting your regex into the 
schema and change the allowed characters like this:

- drop reference to XML Schema regex, as suggested in the original 
comment from Yves?
- have the list of allowed items, as suggested by Shaun at 
http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0180.html

If people agree I'm happy to make the edit, see
https://www.w3.org/International/multilingualweb/lt/track/actions/482

Best,

Felix

Am 08.04.13 12:59, schrieb Pablo Nieto Caride:
> Sorry! forgot to add support to \n \r \t \s etc... here is the regex corrected:
> ((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*)*|(\\w|\\n|\\r|\\t|\\s)*
>
> And here the complex one corrected:
> ((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*)*|(\\w|\\n|\\r|\\t|\\s)*
>
> Cheers,
> Pablo.
> __________________________________
>
> -----Mensaje original-----
> De: Pablo Nieto Caride [mailto:pablo.nieto@linguaserve.com]
> Enviado el: lunes, 08 de abril de 2013 12:45
> Para: 'Felix Sasaki'; 'Jirka Kosek'
> CC: public-multilingualweb-lt@w3.org
> Asunto: RE: [Issue-67] [Action-385] Work on regex for validating regex subset proposal
>
> Hi Felix, all,
>
> The ABNF seems not to be a bad approach, in any case I have reworked the regex (the markers ^and $ at the beginning and the end does not seem to work with XSD) and now it's ok. I did what Felix suggested and changed my its20-types.rng and run ant validate-xml and it worked. Here are the changes and the new regex.
>    <define name="its-allowedCharacters.type">
>      <data type="string">
>   <param name="pattern">((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*)*|(\\w)*</param>
>      </data>
>    </define>
>
> It covers everything except for nested character classes such as [a-d[^c]] which are not widely used and most engines does not support them. If we go on with the previous regex we would have to drop the examples, "[&#x20;-&#x1ffff;-[&lt;>:&quot;\\/|\?*]]" : allows only the characters valid for Windows file names.
> and
> "[a-&#x00ff;-[\s]]" : allows all characters between U+0061 and U+00FF except the characters SPACE (U+0020), TABULATION (U+0009), CARRIAGE RETURN (U+000D) and LINE FEED (U+000F).
> from the specification I imagine, otherwise here is a regex that covers everything but it's huge:
> ((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*)*|(\\w)*
>
> And one last thing ranges such as [a-f-[z]] seem not be very valid since [a-z] it's the same and better.
>
> By the way, Jirka when trying to validate the files the jing.jar of the Test Suite repository didn't work to me, I had to copy the one from your repository html5-its-tools, can anyone confirm this?
>
> Cheers,
> Pablo.
> __________________________________
>
> Am 06.04.13 22:45, schrieb Felix Sasaki:
>> Hi Pablo, all,
>>
>> I had a look at the test suite again and found this kinds of regexes:
>>
>> [a-zA-Z_\-]
>> [^*+]
>> [ &#xFF01;–&#xFF5E;]
>> [&#x0020;-&#x00FE;]
>> [^*+]
>>
>> Maybe it would help to do the ABNF approach that Pablo mentioned
> Ups, sorry, I meant "that Jirka mentioned".
>
> - Felix
>
>> and restrict us with that. See an ABNF below.
>>
>> ========
>> allowedCharacters = start 1*range end ["+"]
>>
>> start = "["
>>
>> end = "]"
>>
>> range = char / char "-" char
>>
>> char = [neg] BMP+escapes
>>
>> neg = "^"
>>
>> ========
>>
>> This means: the regex must always start with "[" and end with "]". In
>> the brackets there must be at least one range. The range can be just
>> one or more characters or a range in the form of character "-" character.
>> The character is "char" which optionally can be forbidden via "^".
>> BMP+escapes then is the Unicode BMP, including the escapes of
>> characters like "[", "]", "-" etc.
>>
>> This is more restricted than what Shaun proposed at
>> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/
>> 0180.html
>>
>> but looking at the test suite and the use case of allowed characters
>> it seems to cover everything.
>>
>> Using the ABNF would not mean to drop the regex. started working on an
>> XML Schema / RELAX NG regex implementing above ABNF, and it looks
>> pretty straightforward.
>>
>> Thoughts?
>>
>> Best,
>>
>> Felix
>>
>>
>> Am 06.04.13 19:18, schrieb Pablo Nieto Caride:
>>> Hi Felix, all,
>>>
>>>
>>> On Apr 6, 2013, at 1:25 PM, Felix Sasaki <fsasaki@w3.org> wrote:
>>>
>>>> Hi Pablo,
>>>>
>>>> sorry for the effort, but to move this forward, we need at least
>>>> make sure that at least the test suite reg ex examples work.
>>>>
>>>> I checked
>>>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input
>>>> data/allowedcharacters/xml
>>>>
>>>> by replacing in my local copy of the test suite
>>>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/schem
>>>> a/its20-types.rng
>>>>
>>>> this part
>>>>    <define name="its-allowedCharacters.type">
>>>>      <data type="string"></data>
>>>>    </define>
>>>> with this, that is inside the "pattern" element your regex for XML
>>>> validation:
>>>>      <data type="string">
>>>>        <param
>>>> name="pattern">^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$</param>
>>>>      </data>
>>>>
>>>> That gave me validation errors like this one:
>>>>
>>>> [jing]
>>>> /its2.0/inputdata/allowedcharacters/xml/allowedcharacters7xmlrules.xml:3:100:
>>>> error: Bad value ?[^*+]? for attribute ?allowedCharacters? on
>>>> element ?allowedCharactersRule? from namespace
>>>> ?http://www.w3.org/2005/11/its?.
>>>>
>>>> Could you change in your local copy of the test suite the "param"
>>>> element with your regex so that the validation for all test suite
>>>> files for allowed characters
>>>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input
>>>> data/allowedcharacters/
>>>>
>>>> works?
>>>>
>>>> FYI, the content is an XML Schema regular expression, so your XML
>>>> version for validation should work finally, I think.
>>>>
>>>> Again, sorry for the effort, but it would be great to have this done
>>>> before the next publication, that is by Thursday next week. Would
>>>> that work for you?
>>>>
>>> I doing some testing with the files you sent me to see how XSD works
>>> with regex and I'm seeing weird things, like problems with ^ and $ to
>>> set the beginning and end of the regex, I'm still working on it.
>>>
>>> I will do as you say and change my local copy of the schema to
>>> validate the Test Suite files.
>>>
>>> Yes I think it'll work for me, there is time and I think I'm close to
>>> the solution. Sorry but it got more complicated than I initially
>>> expected.
>>>
>>> Cheers,
>>> Pablo.
>>>
>>>> Best,
>>>>
>>>> Felix
>>>>
>>>> Am 05.04.13 15:39, schrieb Pablo Nieto Caride:
>>>>> Hi Felix,
>>>>>
>>>>> Yes I tried Allowed Characters Test-Suite's  example before to make
>>>>> sure that the regex worked, and [a-zA-Z_\-] works for me in my
>>>>> system, anyway I'll try what you suggest and get back to you as
>>>>> soon as I have the results.
>>>>>
>>>>> Cheers,
>>>>> Pablo.
>>>>> __________________________________
>>>>>
>>>>> Hi Pablo, all,
>>>>>
>>>>> Am 05.04.13 11:24, schrieb Pablo Nieto Caride:
>>>>>> Hi all,
>>>>>>
>>>>>> I have completed the regex. Finally I decided to restrict it to
>>>>>> Plane 0 (Basic Multilingual Plane 0000-FFFF) because I think is
>>>>>> sufficient and otherwise the regex would be very complex, besides
>>>>>> Shaun didn't actually limit it to Plane 1 (Supplementary
>>>>>> Multilingual Plane 10000–1FFFF) but to Planes 15-16 (10FFFF)
>>>>>> which is too much. I understand it covers the basics (now escapes
>>>>>> of [, ], ^ and -) and does not match incorrect regex, such as
>>>>>> "[f-", supports the greedy and lazy wildcard (this is not really
>>>>>> necessary), and does not support nested character classes (do we
>>>>>> need them? They are rarely used in general). Please test it:
>>>>>> 1) Here is the proposed regular expression escaped with XML
>>>>>> numeric character entities, as if it were put into an XML document:
>>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x
>>>>>> 2C;&
>>>>>>
>>>>>> #x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)
>>>>>> |(\\
>>>>>>
>>>>>> \-)|(\\))+(-)?([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;
>>>>>> -&#x
>>>>>>
>>>>>> D7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>>>> I tried that with an [a-zA-Z_\-]
>>>>> but got a validation error. Could you check a few examples from
>>>>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/inpu
>>>>> tdata/allowedcharacters/html/ to make sure that the regex works?
>>>>> E.g. by creating a schema like the attached one and check with the
>>>>> regex?
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Felix
>>>>>> 2) Here it is with \x{}, for Perl/PCRE only:
>>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\x{09}\x{0A}\x{0D}\x{20}-\x{
>>>>>> 2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\
>>>>>> \\^)|(\\\-)|(\\))+(-)?([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5
>>>>>> A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\
>>>>>> \))+)*\]))*$
>>>>>>
>>>>>>
>>>>>> 3) And here is a regular expression that matches a subset of our
>>>>>> subset, limited to Plane 0, with the \u escape:
>>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\u0009\u000A\u000D\u0020-\u0
>>>>>> 02C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(
>>>>>> \\\-)|(\\))+(-)?([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u00
>>>>>> 5F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>>>>>
>>>>>>
>>>>>> 4) And remember, the backslashes and escaped backslashes are
>>>>>> significant to the regular expression engine. If you're putting
>>>>>> that into a string in a language like Java or C#, you need to
>>>>>> escape the escapes:
>>>>>> re = new
>>>>>> Regex("^((\\.((\\*|\\+)|(\\*\\?|\\+\\?))?)|(\\[\\^?(([\\u0009\\u00
>>>>>> 0A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uF
>>>>>> FFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+(-)?([\\u0009
>>>>>> \\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000
>>>>>> -\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+)*\\]))*
>>>>>> $");
>>>>>>
>>>>>> I'll proceed now to draft text explaining importance of Unicode
>>>>>> normalization and best practices, that's Action-430.
>>>>>>
>>>>>> Cheers,
>>>>>> Pablo.
>>>>>> __________________________________
>>>>>>
>>>>>> Hi Jirka,
>>>>>>
>>>>>> It should not match invalid expressions since it only support
>>>>>> character classes, ranges and negations, but still needs a bit of
>>>>>> polishing regarding escapes. I don't think we need a BNF grammar,
>>>>>> but it's not mine to decide, I just doing what I'm supposed to.
>>>>>>
>>>>>> Cheers,
>>>>>> Pablo.
>>>>>> __________________________________
>>>>>>
>>>>>> On 4.4.2013 17:12, Pablo Nieto Caride wrote:
>>>>>>> Please, implementers and whoever that is interested, give
>>>>>>> feedback if necessary so I can move forward and evolve the regex.
>>>>>> Hi,
>>>>>>
>>>>>> since such complex regular expressions are mostly write-only (it's
>>>>>> very hard to understand what they are trying to match) I'm not
>>>>>> sure what's the point of having this complex regular expression
>>>>>> for checking our regular expression syntax subset. I haven't tried
>>>>>> to get deep understanding of this expression but I bet it will
>>>>>> match even invalid expressions. If we want to have rigorous
>>>>>> definition of our RE syntax we should provide its definition as
>>>>>> grammar written in BNF.
>>>>>>
>>>>>>                      Jirka
>>>>>>
>>>>>> --
>>>>>> ------------------------------------------------------------------
>>>>>>      Jirka Kosek      e-mail: jirka@kosek.cz http://xmlguru.cz
>>>>>> ------------------------------------------------------------------
>>>>>>           Professional XML consulting and training services
>>>>>>      DocBook customization, custom XSLT/XSL-FO document processing
>>>>>> ------------------------------------------------------------------
>>>>>>     OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
>>>>>> ------------------------------------------------------------------
>>>>>>        Bringing you XML Prague conference http://xmlprague.cz
>>>>>> ------------------------------------------------------------------
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>
>
>
Received on Monday, 8 April 2013 15:02:20 UTC