RE: [Issue-67] [Action-385] Work on regex for validating regex subset proposal from Pablo Nieto Caride on 2013-04-05 (public-multilingualweb-lt@w3.org from April 2013)

From: Pablo Nieto Caride <pablo.nieto@linguaserve.com>
Date: Fri, 5 Apr 2013 11:24:06 +0200
To: "'Jirka Kosek'" <jirka@kosek.cz>
Cc: <public-multilingualweb-lt@w3.org>
Message-ID: <061501ce31df$520413b0$f60c3b10$@linguaserve.com>
Hi all,

I have completed the regex. Finally I decided to restrict it to Plane 0 (Basic Multilingual Plane 0000-FFFF) because I think is sufficient and otherwise the regex would be very complex, besides Shaun didn't actually limit it to Plane 1 (Supplementary Multilingual Plane 10000–1FFFF) but to Planes 15-16 (10FFFF) which is too much. I understand it covers the basics (now escapes of [, ], ^ and -) and does not match incorrect regex, such as "[f-", supports the greedy and lazy wildcard (this is not really necessary), and does not support nested character classes (do we need them? They are rarely used in general). Please test it:
1) Here is the proposed regular expression escaped with XML numeric character entities, as if it were put into an XML document:
^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5A;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$

2) Here it is with \x{}, for Perl/PCRE only:
^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$

3) And here is a regular expression that matches a subset of our subset, limited to Plane 0, with the \u escape:
^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$

4) And remember, the backslashes and escaped backslashes are significant to the regular expression engine. If you're putting that into a string in a language like Java or C#, you need to escape the escapes:
re = new Regex("^((\\.((\\*|\\+)|(\\*\\?|\\+\\?))?)|(\\[\\^?(([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+(-)?([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+)*\\]))*$");

I'll proceed now to draft text explaining importance of Unicode normalization and best practices, that's Action-430.

Cheers,
Pablo.
__________________________________

Hi Jirka,

It should not match invalid expressions since it only support character classes, ranges and negations, but still needs a bit of polishing regarding escapes. I don't think we need a BNF grammar, but it's not mine to decide, I just doing what I'm supposed to.

Cheers,
Pablo.
__________________________________

On 4.4.2013 17:12, Pablo Nieto Caride wrote:
> Please, implementers and whoever that is interested, give feedback if 
> necessary so I can move forward and evolve the regex.

Hi,

since such complex regular expressions are mostly write-only (it's very hard to understand what they are trying to match) I'm not sure what's the point of having this complex regular expression for checking our regular expression syntax subset. I haven't tried to get deep understanding of this expression but I bet it will match even invalid expressions. If we want to have rigorous definition of our RE syntax we should provide its definition as grammar written in BNF.

     Jirka

--
------------------------------------------------------------------
  Jirka Kosek      e-mail: jirka@kosek.cz      http://xmlguru.cz
------------------------------------------------------------------
       Professional XML consulting and training services
  DocBook customization, custom XSLT/XSL-FO document processing
------------------------------------------------------------------
 OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
------------------------------------------------------------------
    Bringing you XML Prague conference    http://xmlprague.cz
------------------------------------------------------------------
Received on Friday, 5 April 2013 09:24:42 UTC