- From: Pablo Nieto Caride <pablo.nieto@linguaserve.com>
- Date: Mon, 8 Apr 2013 12:45:01 +0200
- To: "'Felix Sasaki'" <fsasaki@w3.org>, "'Jirka Kosek'" <jirka@kosek.cz>
- Cc: <public-multilingualweb-lt@w3.org>
Hi Felix, all, The ABNF seems not to be a bad approach, in any case I have reworked the regex (the markers ^and $ at the beginning and the end does not seem to work with XSD) and now it's ok. I did what Felix suggested and changed my its20-types.rng and run ant validate-xml and it worked. Here are the changes and the new regex. <define name="its-allowedCharacters.type"> <data type="string"> <param name="pattern">((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*)* |(\\w)*</param> </data> </define> It covers everything except for nested character classes such as [a-d[^c]] which are not widely used and most engines does not support them. If we go on with the previous regex we would have to drop the examples, "[ --[<>:"\\/|\?*]]" : allows only the characters valid for Windows file names. and "[a-ÿ-[\s]]" : allows all characters between U+0061 and U+00FF except the characters SPACE (U+0020), TABULATION (U+0009), CARRIAGE RETURN (U+000D) and LINE FEED (U+000F). from the specification I imagine, otherwise here is a regex that covers everything but it's huge: ((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*)*|(\\w)* And one last thing ranges such as [a-f-[z]] seem not be very valid since [a-z] it's the same and better. By the way, Jirka when trying to validate the files the jing.jar of the Test Suite repository didn't work to me, I had to copy the one from your repository html5-its-tools, can anyone confirm this? Cheers, Pablo. __________________________________ Am 06.04.13 22:45, schrieb Felix Sasaki: > Hi Pablo, all, > > I had a look at the test suite again and found this kinds of regexes: > > [a-zA-Z_\-] > [^*+] > [ !–~] > [ -þ] > [^*+] > > Maybe it would help to do the ABNF approach that Pablo mentioned Ups, sorry, I meant "that Jirka mentioned". - Felix > and restrict us with that. See an ABNF below. > > ======== > allowedCharacters = start 1*range end ["+"] > > start = "[" > > end = "]" > > range = char / char "-" char > > char = [neg] BMP+escapes > > neg = "^" > > ======== > > This means: the regex must always start with "[" and end with "]". In > the brackets there must be at least one range. The range can be just > one or more characters or a range in the form of character "-" character. > The character is "char" which optionally can be forbidden via "^". > BMP+escapes then is the Unicode BMP, including the escapes of > characters like "[", "]", "-" etc. > > This is more restricted than what Shaun proposed at > http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/ > 0180.html > > but looking at the test suite and the use case of allowed characters > it seems to cover everything. > > Using the ABNF would not mean to drop the regex. started working on an > XML Schema / RELAX NG regex implementing above ABNF, and it looks > pretty straightforward. > > Thoughts? > > Best, > > Felix > > > Am 06.04.13 19:18, schrieb Pablo Nieto Caride: >> Hi Felix, all, >> >> >> On Apr 6, 2013, at 1:25 PM, Felix Sasaki <fsasaki@w3.org> wrote: >> >>> Hi Pablo, >>> >>> sorry for the effort, but to move this forward, we need at least >>> make sure that at least the test suite reg ex examples work. >>> >>> I checked >>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input >>> data/allowedcharacters/xml >>> >>> by replacing in my local copy of the test suite >>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/schem >>> a/its20-types.rng >>> >>> this part >>> <define name="its-allowedCharacters.type"> >>> <data type="string"></data> >>> </define> >>> with this, that is inside the "pattern" element your regex for XML >>> validation: >>> <data type="string"> >>> <param >>> name="pattern">^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$</param> >>> </data> >>> >>> That gave me validation errors like this one: >>> >>> [jing] >>> /its2.0/inputdata/allowedcharacters/xml/allowedcharacters7xmlrules.xml:3:100: >>> error: Bad value ?[^*+]? for attribute ?allowedCharacters? on >>> element ?allowedCharactersRule? from namespace >>> ?http://www.w3.org/2005/11/its?. >>> >>> Could you change in your local copy of the test suite the "param" >>> element with your regex so that the validation for all test suite >>> files for allowed characters >>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input >>> data/allowedcharacters/ >>> >>> works? >>> >>> FYI, the content is an XML Schema regular expression, so your XML >>> version for validation should work finally, I think. >>> >>> Again, sorry for the effort, but it would be great to have this done >>> before the next publication, that is by Thursday next week. Would >>> that work for you? >>> >> I doing some testing with the files you sent me to see how XSD works >> with regex and I'm seeing weird things, like problems with ^ and $ to >> set the beginning and end of the regex, I'm still working on it. >> >> I will do as you say and change my local copy of the schema to >> validate the Test Suite files. >> >> Yes I think it'll work for me, there is time and I think I'm close to >> the solution. Sorry but it got more complicated than I initially >> expected. >> >> Cheers, >> Pablo. >> >>> Best, >>> >>> Felix >>> >>> Am 05.04.13 15:39, schrieb Pablo Nieto Caride: >>>> Hi Felix, >>>> >>>> Yes I tried Allowed Characters Test-Suite's example before to make >>>> sure that the regex worked, and [a-zA-Z_\-] works for me in my >>>> system, anyway I'll try what you suggest and get back to you as >>>> soon as I have the results. >>>> >>>> Cheers, >>>> Pablo. >>>> __________________________________ >>>> >>>> Hi Pablo, all, >>>> >>>> Am 05.04.13 11:24, schrieb Pablo Nieto Caride: >>>>> Hi all, >>>>> >>>>> I have completed the regex. Finally I decided to restrict it to >>>>> Plane 0 (Basic Multilingual Plane 0000-FFFF) because I think is >>>>> sufficient and otherwise the regex would be very complex, besides >>>>> Shaun didn't actually limit it to Plane 1 (Supplementary >>>>> Multilingual Plane 10000–1FFFF) but to Planes 15-16 (10FFFF) >>>>> which is too much. I understand it covers the basics (now escapes >>>>> of [, ], ^ and -) and does not match incorrect regex, such as >>>>> "[f-", supports the greedy and lazy wildcard (this is not really >>>>> necessary), and does not support nested character classes (do we >>>>> need them? They are rarely used in general). Please test it: >>>>> 1) Here is the proposed regular expression escaped with XML >>>>> numeric character entities, as if it were put into an XML document: >>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -&#x >>>>> 2C;& >>>>> >>>>> #x2E;-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^) >>>>> |(\\ >>>>> >>>>> \-)|(\\))+(-)?([	

 -,.-Z_ >>>>> -&#x >>>>> >>>>> D7FF;-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$ >>>> I tried that with an [a-zA-Z_\-] >>>> but got a validation error. Could you check a few examples from >>>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/inpu >>>> tdata/allowedcharacters/html/ to make sure that the regex works? >>>> E.g. by creating a schema like the attached one and check with the >>>> regex? >>>> >>>> >>>> Best, >>>> >>>> Felix >>>>> 2) Here it is with \x{}, for Perl/PCRE only: >>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\x{09}\x{0A}\x{0D}\x{20}-\x{ >>>>> 2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\ >>>>> \\^)|(\\\-)|(\\))+(-)?([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5 >>>>> A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\ >>>>> \))+)*\]))*$ >>>>> >>>>> >>>>> 3) And here is a regular expression that matches a subset of our >>>>> subset, limited to Plane 0, with the \u escape: >>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\u0009\u000A\u000D\u0020-\u0 >>>>> 02C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|( >>>>> \\\-)|(\\))+(-)?([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u00 >>>>> 5F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$ >>>>> >>>>> >>>>> 4) And remember, the backslashes and escaped backslashes are >>>>> significant to the regular expression engine. If you're putting >>>>> that into a string in a language like Java or C#, you need to >>>>> escape the escapes: >>>>> re = new >>>>> Regex("^((\\.((\\*|\\+)|(\\*\\?|\\+\\?))?)|(\\[\\^?(([\\u0009\\u00 >>>>> 0A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uF >>>>> FFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+(-)?([\\u0009 >>>>> \\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000 >>>>> -\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+)*\\]))* >>>>> $"); >>>>> >>>>> I'll proceed now to draft text explaining importance of Unicode >>>>> normalization and best practices, that's Action-430. >>>>> >>>>> Cheers, >>>>> Pablo. >>>>> __________________________________ >>>>> >>>>> Hi Jirka, >>>>> >>>>> It should not match invalid expressions since it only support >>>>> character classes, ranges and negations, but still needs a bit of >>>>> polishing regarding escapes. I don't think we need a BNF grammar, >>>>> but it's not mine to decide, I just doing what I'm supposed to. >>>>> >>>>> Cheers, >>>>> Pablo. >>>>> __________________________________ >>>>> >>>>> On 4.4.2013 17:12, Pablo Nieto Caride wrote: >>>>>> Please, implementers and whoever that is interested, give >>>>>> feedback if necessary so I can move forward and evolve the regex. >>>>> Hi, >>>>> >>>>> since such complex regular expressions are mostly write-only (it's >>>>> very hard to understand what they are trying to match) I'm not >>>>> sure what's the point of having this complex regular expression >>>>> for checking our regular expression syntax subset. I haven't tried >>>>> to get deep understanding of this expression but I bet it will >>>>> match even invalid expressions. If we want to have rigorous >>>>> definition of our RE syntax we should provide its definition as >>>>> grammar written in BNF. >>>>> >>>>> Jirka >>>>> >>>>> -- >>>>> ------------------------------------------------------------------ >>>>> Jirka Kosek e-mail: jirka@kosek.cz http://xmlguru.cz >>>>> ------------------------------------------------------------------ >>>>> Professional XML consulting and training services >>>>> DocBook customization, custom XSLT/XSL-FO document processing >>>>> ------------------------------------------------------------------ >>>>> OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep. >>>>> ------------------------------------------------------------------ >>>>> Bringing you XML Prague conference http://xmlprague.cz >>>>> ------------------------------------------------------------------ >>>>> >>>>> >>>>> >>>>> >>> >> > >
Received on Monday, 8 April 2013 10:45:38 UTC