- From: Shaun McCance <shaunm@gnome.org>
- Date: Mon, 04 Feb 2013 16:28:45 -0500
- To: public-multilingualweb-lt@w3.org
On Mon, 2013-02-04 at 13:46 -0700, Yves Savourel wrote: > Hi Shaun, > > Many thanks for this Shaun. > > I've added it to our ITS processing to check the its:allowedCharacters > value and noticed that some of the test files have the expression "[^* > +]" which seems to be not valid based this checking expression. (I > still have to make sure my validation code is right). > > Is that the case? If yes, how would we express "any chars but '*' and > '+'"? My mistake. It seems "^" maintains special meaning even when not at the beginning of the expression or a character class, so we have to escape it. ^(\.| \[\^?-?(([	

 -,.-[_-퟿-�𐀀-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([	

 -,.-[_-퟿-�𐀀-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$ ^(\.| \[\^?-?(([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$ ^(\.|\[\^?-?(([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005B\u005F-\uD7FF \uE000-\uFFFD]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\u0009\u000A\u000D \u0020-\u002C\u002E-\u005B\u005F-\uD7FF\uE000-\uFFFD\u10000-#x10FFFF]|\ \n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$ re = new Regex("^(\\.|\\[\\^?-?(([\\u0009\\u000A\\u000D\\u0020-\\u002C\ \u002E-\\u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\ \\^|\\\\-|\\\\\\\\)(-([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\ \u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\\\^|\\\ \-|\\\\\\\\))?)+-?\\])?$"); > > cheers, > -yves > > -----Original Message----- > From: Shaun McCance [mailto:shaunm@gnome.org] > Sent: Monday, February 04, 2013 11:53 AM > To: public-multilingualweb-lt@w3.org > Subject: Re: [ACTION-385] Common regular expression syntax > > On Sun, 2013-01-27 at 12:30 -0500, Shaun McCance wrote: > > So what I think this leaves us with is character classes [abc], ranges > > [a-c], and negations [^abc], there "^" and "]" must never appear > > unless backslash-escaped, "-" may be backslash-escaped or put at the > > beginning or end, the escape sequences "\n", "\r", "\t", "\d", and > > "\D" may be used, and literal "\" is escaped as "\\". > > > > Importantly, you must never have an unescaped backslash, because some > > dialects may treat it as the beginning of an escape sequence that > > means something special. > > > > This is a very limited subset, but I think it's what we have to use. > > I'm now going to try to make a portable RE that matches these portable > > RE character classes. > > Upon further investigation, it seems some engines allow Unicode characters outside 0-9 for \d, so that's out too. There's an open question of what characters can be referred to. I decided to use the definition of Char in XML 1.0: > > http://www.w3.org/TR/REC-xml/#charsets > > It's hard to reference these, because many of the range boundary characters are unassigned, so effectively unprintable. I think we don't want to embed the literal character U+D7FF in the spec. > > Here is the proposed regular expression escaped with XML numeric character entities, as if it were put into an XML document: > > ^(\.| > \[^?-?(([	

 -,.-[_-퟿-�𐀀-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([	

 -,.-[_-퟿-�𐀀-#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$ > > (Email will almost certainly add line breaks. Ignore them.) > > There are two ways I know of to escape characters (not bytes) in different engines: \x{2234} and \u2234. The \u syntax can only reference Plane 1 characters, and works in everything except XSD and Perl/PCRE. The \x{} syntax is only Perl/PCRE, but can specify any character. > > Here it is with \x{}, for Perl/PCRE only: > > ^(\.| > \[^?-?(([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E-\x{5B}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-#x10FFFF}]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$ > > And here is a regular expression that matches a subset of our subset, limited to Plane 1, with the \u escape: > > ^(\.|\[^?-?(([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005B\u005F-\uD7FF > \uE000-\uFFFD]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([\u0009\u000A\u000D > \u0020-\u002C\u002E-\u005B\u005F-\uD7FF\uE000-\uFFFD\u10000-#x10FFFF]|\ > \n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$ > > And remember, the backslashes and escaped backslashes are significant to the regular expression engine. If you're putting that into a string in a language like Java or C#, you need to escape the escapes: > > re = new Regex("^(\\.|\\[^?-?(([\\u0009\\u000A\\u000D\\u0020-\\u002C\ > \u002E-\\u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\ > \\^|\\\\-|\\\\\\\\)(-([\\u0009\\u000A\\u000D\\u0020-\\u002C\\u002E-\ > \u005B\\u005F-\\uD7FF\\uE000-\\uFFFD]|\\\\n|\\\\r|\\\\t|\\\\]|\\\\^|\\\ > \-|\\\\\\\\))?)+-?\\])?$"); > > -- > Shaun > > > > >
Received on Monday, 4 February 2013 21:29:08 UTC