- From: Felix Sasaki <fsasaki@w3.org>
- Date: Mon, 08 Apr 2013 17:01:38 +0200
- To: Pablo Nieto Caride <pablo.nieto@linguaserve.com>
- CC: 'Jirka Kosek' <jirka@kosek.cz>, public-multilingualweb-lt@w3.org
Thanks a lot, Pablo. I think your regex does allow things that would be forbidden with the ABNF I had proposed (e.g. starting without "["), so maybe it is better not to have the ABNF. Otherwise users might be confused. How about closing issue-67 by putting your regex into the schema and change the allowed characters like this: - drop reference to XML Schema regex, as suggested in the original comment from Yves? - have the list of allowed items, as suggested by Shaun at http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0180.html If people agree I'm happy to make the edit, see https://www.w3.org/International/multilingualweb/lt/track/actions/482 Best, Felix Am 08.04.13 12:59, schrieb Pablo Nieto Caride: > Sorry! forgot to add support to \n \r \t \s etc... here is the regex corrected: > ((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*)*|(\\w|\\n|\\r|\\t|\\s)* > > And here the complex one corrected: > ((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*)*|(\\w|\\n|\\r|\\t|\\s)* > > Cheers, > Pablo. > __________________________________ > > -----Mensaje original----- > De: Pablo Nieto Caride [mailto:pablo.nieto@linguaserve.com] > Enviado el: lunes, 08 de abril de 2013 12:45 > Para: 'Felix Sasaki'; 'Jirka Kosek' > CC: public-multilingualweb-lt@w3.org > Asunto: RE: [Issue-67] [Action-385] Work on regex for validating regex subset proposal > > Hi Felix, all, > > The ABNF seems not to be a bad approach, in any case I have reworked the regex (the markers ^and $ at the beginning and the end does not seem to work with XSD) and now it's ok. I did what Felix suggested and changed my its20-types.rng and run ant validate-xml and it worked. Here are the changes and the new regex. > <define name="its-allowedCharacters.type"> > <data type="string"> > <param name="pattern">((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*)*|(\\w)*</param> > </data> > </define> > > It covers everything except for nested character classes such as [a-d[^c]] which are not widely used and most engines does not support them. If we go on with the previous regex we would have to drop the examples, "[ --[<>:"\\/|\?*]]" : allows only the characters valid for Windows file names. > and > "[a-ÿ-[\s]]" : allows all characters between U+0061 and U+00FF except the characters SPACE (U+0020), TABULATION (U+0009), CARRIAGE RETURN (U+000D) and LINE FEED (U+000F). > from the specification I imagine, otherwise here is a regex that covers everything but it's huge: > ((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*)*|(\\w)* > > And one last thing ranges such as [a-f-[z]] seem not be very valid since [a-z] it's the same and better. > > By the way, Jirka when trying to validate the files the jing.jar of the Test Suite repository didn't work to me, I had to copy the one from your repository html5-its-tools, can anyone confirm this? > > Cheers, > Pablo. > __________________________________ > > Am 06.04.13 22:45, schrieb Felix Sasaki: >> Hi Pablo, all, >> >> I had a look at the test suite again and found this kinds of regexes: >> >> [a-zA-Z_\-] >> [^*+] >> [ !–~] >> [ -þ] >> [^*+] >> >> Maybe it would help to do the ABNF approach that Pablo mentioned > Ups, sorry, I meant "that Jirka mentioned". > > - Felix > >> and restrict us with that. See an ABNF below. >> >> ======== >> allowedCharacters = start 1*range end ["+"] >> >> start = "[" >> >> end = "]" >> >> range = char / char "-" char >> >> char = [neg] BMP+escapes >> >> neg = "^" >> >> ======== >> >> This means: the regex must always start with "[" and end with "]". In >> the brackets there must be at least one range. The range can be just >> one or more characters or a range in the form of character "-" character. >> The character is "char" which optionally can be forbidden via "^". >> BMP+escapes then is the Unicode BMP, including the escapes of >> characters like "[", "]", "-" etc. >> >> This is more restricted than what Shaun proposed at >> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/ >> 0180.html >> >> but looking at the test suite and the use case of allowed characters >> it seems to cover everything. >> >> Using the ABNF would not mean to drop the regex. started working on an >> XML Schema / RELAX NG regex implementing above ABNF, and it looks >> pretty straightforward. >> >> Thoughts? >> >> Best, >> >> Felix >> >> >> Am 06.04.13 19:18, schrieb Pablo Nieto Caride: >>> Hi Felix, all, >>> >>> >>> On Apr 6, 2013, at 1:25 PM, Felix Sasaki <fsasaki@w3.org> wrote: >>> >>>> Hi Pablo, >>>> >>>> sorry for the effort, but to move this forward, we need at least >>>> make sure that at least the test suite reg ex examples work. >>>> >>>> I checked >>>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input >>>> data/allowedcharacters/xml >>>> >>>> by replacing in my local copy of the test suite >>>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/schem >>>> a/its20-types.rng >>>> >>>> this part >>>> <define name="its-allowedCharacters.type"> >>>> <data type="string"></data> >>>> </define> >>>> with this, that is inside the "pattern" element your regex for XML >>>> validation: >>>> <data type="string"> >>>> <param >>>> name="pattern">^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$</param> >>>> </data> >>>> >>>> That gave me validation errors like this one: >>>> >>>> [jing] >>>> /its2.0/inputdata/allowedcharacters/xml/allowedcharacters7xmlrules.xml:3:100: >>>> error: Bad value ?[^*+]? for attribute ?allowedCharacters? on >>>> element ?allowedCharactersRule? from namespace >>>> ?http://www.w3.org/2005/11/its?. >>>> >>>> Could you change in your local copy of the test suite the "param" >>>> element with your regex so that the validation for all test suite >>>> files for allowed characters >>>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input >>>> data/allowedcharacters/ >>>> >>>> works? >>>> >>>> FYI, the content is an XML Schema regular expression, so your XML >>>> version for validation should work finally, I think. >>>> >>>> Again, sorry for the effort, but it would be great to have this done >>>> before the next publication, that is by Thursday next week. Would >>>> that work for you? >>>> >>> I doing some testing with the files you sent me to see how XSD works >>> with regex and I'm seeing weird things, like problems with ^ and $ to >>> set the beginning and end of the regex, I'm still working on it. >>> >>> I will do as you say and change my local copy of the schema to >>> validate the Test Suite files. >>> >>> Yes I think it'll work for me, there is time and I think I'm close to >>> the solution. Sorry but it got more complicated than I initially >>> expected. >>> >>> Cheers, >>> Pablo. >>> >>>> Best, >>>> >>>> Felix >>>> >>>> Am 05.04.13 15:39, schrieb Pablo Nieto Caride: >>>>> Hi Felix, >>>>> >>>>> Yes I tried Allowed Characters Test-Suite's example before to make >>>>> sure that the regex worked, and [a-zA-Z_\-] works for me in my >>>>> system, anyway I'll try what you suggest and get back to you as >>>>> soon as I have the results. >>>>> >>>>> Cheers, >>>>> Pablo. >>>>> __________________________________ >>>>> >>>>> Hi Pablo, all, >>>>> >>>>> Am 05.04.13 11:24, schrieb Pablo Nieto Caride: >>>>>> Hi all, >>>>>> >>>>>> I have completed the regex. Finally I decided to restrict it to >>>>>> Plane 0 (Basic Multilingual Plane 0000-FFFF) because I think is >>>>>> sufficient and otherwise the regex would be very complex, besides >>>>>> Shaun didn't actually limit it to Plane 1 (Supplementary >>>>>> Multilingual Plane 10000–1FFFF) but to Planes 15-16 (10FFFF) >>>>>> which is too much. I understand it covers the basics (now escapes >>>>>> of [, ], ^ and -) and does not match incorrect regex, such as >>>>>> "[f-", supports the greedy and lazy wildcard (this is not really >>>>>> necessary), and does not support nested character classes (do we >>>>>> need them? They are rarely used in general). Please test it: >>>>>> 1) Here is the proposed regular expression escaped with XML >>>>>> numeric character entities, as if it were put into an XML document: >>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -&#x >>>>>> 2C;& >>>>>> >>>>>> #x2E;-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^) >>>>>> |(\\ >>>>>> >>>>>> \-)|(\\))+(-)?([	

 -,.-Z_ >>>>>> -&#x >>>>>> >>>>>> D7FF;-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$ >>>>> I tried that with an [a-zA-Z_\-] >>>>> but got a validation error. Could you check a few examples from >>>>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/inpu >>>>> tdata/allowedcharacters/html/ to make sure that the regex works? >>>>> E.g. by creating a schema like the attached one and check with the >>>>> regex? >>>>> >>>>> >>>>> Best, >>>>> >>>>> Felix >>>>>> 2) Here it is with \x{}, for Perl/PCRE only: >>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\x{09}\x{0A}\x{0D}\x{20}-\x{ >>>>>> 2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\ >>>>>> \\^)|(\\\-)|(\\))+(-)?([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5 >>>>>> A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\ >>>>>> \))+)*\]))*$ >>>>>> >>>>>> >>>>>> 3) And here is a regular expression that matches a subset of our >>>>>> subset, limited to Plane 0, with the \u escape: >>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\u0009\u000A\u000D\u0020-\u0 >>>>>> 02C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|( >>>>>> \\\-)|(\\))+(-)?([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u00 >>>>>> 5F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$ >>>>>> >>>>>> >>>>>> 4) And remember, the backslashes and escaped backslashes are >>>>>> significant to the regular expression engine. If you're putting >>>>>> that into a string in a language like Java or C#, you need to >>>>>> escape the escapes: >>>>>> re = new >>>>>> Regex("^((\\.((\\*|\\+)|(\\*\\?|\\+\\?))?)|(\\[\\^?(([\\u0009\\u00 >>>>>> 0A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uF >>>>>> FFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+(-)?([\\u0009 >>>>>> \\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000 >>>>>> -\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+)*\\]))* >>>>>> $"); >>>>>> >>>>>> I'll proceed now to draft text explaining importance of Unicode >>>>>> normalization and best practices, that's Action-430. >>>>>> >>>>>> Cheers, >>>>>> Pablo. >>>>>> __________________________________ >>>>>> >>>>>> Hi Jirka, >>>>>> >>>>>> It should not match invalid expressions since it only support >>>>>> character classes, ranges and negations, but still needs a bit of >>>>>> polishing regarding escapes. I don't think we need a BNF grammar, >>>>>> but it's not mine to decide, I just doing what I'm supposed to. >>>>>> >>>>>> Cheers, >>>>>> Pablo. >>>>>> __________________________________ >>>>>> >>>>>> On 4.4.2013 17:12, Pablo Nieto Caride wrote: >>>>>>> Please, implementers and whoever that is interested, give >>>>>>> feedback if necessary so I can move forward and evolve the regex. >>>>>> Hi, >>>>>> >>>>>> since such complex regular expressions are mostly write-only (it's >>>>>> very hard to understand what they are trying to match) I'm not >>>>>> sure what's the point of having this complex regular expression >>>>>> for checking our regular expression syntax subset. I haven't tried >>>>>> to get deep understanding of this expression but I bet it will >>>>>> match even invalid expressions. If we want to have rigorous >>>>>> definition of our RE syntax we should provide its definition as >>>>>> grammar written in BNF. >>>>>> >>>>>> Jirka >>>>>> >>>>>> -- >>>>>> ------------------------------------------------------------------ >>>>>> Jirka Kosek e-mail: jirka@kosek.cz http://xmlguru.cz >>>>>> ------------------------------------------------------------------ >>>>>> Professional XML consulting and training services >>>>>> DocBook customization, custom XSLT/XSL-FO document processing >>>>>> ------------------------------------------------------------------ >>>>>> OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep. >>>>>> ------------------------------------------------------------------ >>>>>> Bringing you XML Prague conference http://xmlprague.cz >>>>>> ------------------------------------------------------------------ >>>>>> >>>>>> >>>>>> >>>>>> >> > >
Received on Monday, 8 April 2013 15:02:20 UTC