- From: Pablo Nieto Caride <pablo.nieto@linguaserve.com>
- Date: Mon, 8 Apr 2013 12:45:01 +0200
- To: "'Felix Sasaki'" <fsasaki@w3.org>, "'Jirka Kosek'" <jirka@kosek.cz>
- Cc: <public-multilingualweb-lt@w3.org>
Hi Felix, all,
The ABNF seems not to be a bad approach, in any case I have reworked the regex (the markers ^and $ at the beginning and the end does not seem to work with XSD) and now it's ok. I did what Felix suggested and changed my its20-types.rng and run ant validate-xml and it worked. Here are the changes and the new regex.
<define name="its-allowedCharacters.type">
<data type="string">
<param name="pattern">((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\])*)* |(\\w)*</param>
</data>
</define>
It covers everything except for nested character classes such as [a-d[^c]] which are not widely used and most engines does not support them. If we go on with the previous regex we would have to drop the examples,
"[ --[<>:"\\/|\?*]]" : allows only the characters valid for Windows file names.
and
"[a-ÿ-[\s]]" : allows all characters between U+0061 and U+00FF except the characters SPACE (U+0020), TABULATION (U+0009), CARRIAGE RETURN (U+000D) and LINE FEED (U+000F).
from the specification I imagine, otherwise here is a regex that covers everything but it's huge:
((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\)|)+)+\])|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)+\]))+)+\])*)*|(\\w)*
And one last thing ranges such as [a-f-[z]] seem not be very valid since [a-z] it's the same and better.
By the way, Jirka when trying to validate the files the jing.jar of the Test Suite repository didn't work to me, I had to copy the one from your repository html5-its-tools, can anyone confirm this?
Cheers,
Pablo.
__________________________________
Am 06.04.13 22:45, schrieb Felix Sasaki:
> Hi Pablo, all,
>
> I had a look at the test suite again and found this kinds of regexes:
>
> [a-zA-Z_\-]
> [^*+]
> [ !–~]
> [ -þ]
> [^*+]
>
> Maybe it would help to do the ABNF approach that Pablo mentioned
Ups, sorry, I meant "that Jirka mentioned".
- Felix
> and restrict us with that. See an ABNF below.
>
> ========
> allowedCharacters = start 1*range end ["+"]
>
> start = "["
>
> end = "]"
>
> range = char / char "-" char
>
> char = [neg] BMP+escapes
>
> neg = "^"
>
> ========
>
> This means: the regex must always start with "[" and end with "]". In
> the brackets there must be at least one range. The range can be just
> one or more characters or a range in the form of character "-" character.
> The character is "char" which optionally can be forbidden via "^".
> BMP+escapes then is the Unicode BMP, including the escapes of
> characters like "[", "]", "-" etc.
>
> This is more restricted than what Shaun proposed at
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/
> 0180.html
>
> but looking at the test suite and the use case of allowed characters
> it seems to cover everything.
>
> Using the ABNF would not mean to drop the regex. started working on an
> XML Schema / RELAX NG regex implementing above ABNF, and it looks
> pretty straightforward.
>
> Thoughts?
>
> Best,
>
> Felix
>
>
> Am 06.04.13 19:18, schrieb Pablo Nieto Caride:
>> Hi Felix, all,
>>
>>
>> On Apr 6, 2013, at 1:25 PM, Felix Sasaki <fsasaki@w3.org> wrote:
>>
>>> Hi Pablo,
>>>
>>> sorry for the effort, but to move this forward, we need at least
>>> make sure that at least the test suite reg ex examples work.
>>>
>>> I checked
>>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input
>>> data/allowedcharacters/xml
>>>
>>> by replacing in my local copy of the test suite
>>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/schem
>>> a/its20-types.rng
>>>
>>> this part
>>> <define name="its-allowedCharacters.type">
>>> <data type="string"></data>
>>> </define>
>>> with this, that is inside the "pattern" element your regex for XML
>>> validation:
>>> <data type="string">
>>> <param
>>> name="pattern">^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+(-)?([	

 -,.-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$</param>
>>> </data>
>>>
>>> That gave me validation errors like this one:
>>>
>>> [jing]
>>> /its2.0/inputdata/allowedcharacters/xml/allowedcharacters7xmlrules.xml:3:100:
>>> error: Bad value ?[^*+]? for attribute ?allowedCharacters? on
>>> element ?allowedCharactersRule? from namespace
>>> ?http://www.w3.org/2005/11/its?.
>>>
>>> Could you change in your local copy of the test suite the "param"
>>> element with your regex so that the validation for all test suite
>>> files for allowed characters
>>> https://github.com/finnle/ITS-2.0-Testsuite/tree/master/its2.0/input
>>> data/allowedcharacters/
>>>
>>> works?
>>>
>>> FYI, the content is an XML Schema regular expression, so your XML
>>> version for validation should work finally, I think.
>>>
>>> Again, sorry for the effort, but it would be great to have this done
>>> before the next publication, that is by Thursday next week. Would
>>> that work for you?
>>>
>> I doing some testing with the files you sent me to see how XSD works
>> with regex and I'm seeing weird things, like problems with ^ and $ to
>> set the beginning and end of the regex, I'm still working on it.
>>
>> I will do as you say and change my local copy of the schema to
>> validate the Test Suite files.
>>
>> Yes I think it'll work for me, there is time and I think I'm close to
>> the solution. Sorry but it got more complicated than I initially
>> expected.
>>
>> Cheers,
>> Pablo.
>>
>>> Best,
>>>
>>> Felix
>>>
>>> Am 05.04.13 15:39, schrieb Pablo Nieto Caride:
>>>> Hi Felix,
>>>>
>>>> Yes I tried Allowed Characters Test-Suite's example before to make
>>>> sure that the regex worked, and [a-zA-Z_\-] works for me in my
>>>> system, anyway I'll try what you suggest and get back to you as
>>>> soon as I have the results.
>>>>
>>>> Cheers,
>>>> Pablo.
>>>> __________________________________
>>>>
>>>> Hi Pablo, all,
>>>>
>>>> Am 05.04.13 11:24, schrieb Pablo Nieto Caride:
>>>>> Hi all,
>>>>>
>>>>> I have completed the regex. Finally I decided to restrict it to
>>>>> Plane 0 (Basic Multilingual Plane 0000-FFFF) because I think is
>>>>> sufficient and otherwise the regex would be very complex, besides
>>>>> Shaun didn't actually limit it to Plane 1 (Supplementary
>>>>> Multilingual Plane 10000–1FFFF) but to Planes 15-16 (10FFFF)
>>>>> which is too much. I understand it covers the basics (now escapes
>>>>> of [, ], ^ and -) and does not match incorrect regex, such as
>>>>> "[f-", supports the greedy and lazy wildcard (this is not really
>>>>> necessary), and does not support nested character classes (do we
>>>>> need them? They are rarely used in general). Please test it:
>>>>> 1) Here is the proposed regular expression escaped with XML
>>>>> numeric character entities, as if it were put into an XML document:
>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([	

 -&#x
>>>>> 2C;&
>>>>>
>>>>> #x2E;-Z_-퟿-�]|(\\\[)|(\\\])|(\\\^)
>>>>> |(\\
>>>>>
>>>>> \-)|(\\))+(-)?([	

 -,.-Z_
>>>>> -&#x
>>>>>
>>>>> D7FF;-�]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>>> I tried that with an [a-zA-Z_\-]
>>>> but got a validation error. Could you check a few examples from
>>>> https://github.com/finnle/ITS-2.0-Testsuite/blob/master/its2.0/inpu
>>>> tdata/allowedcharacters/html/ to make sure that the regex works?
>>>> E.g. by creating a schema like the attached one and check with the
>>>> regex?
>>>>
>>>>
>>>> Best,
>>>>
>>>> Felix
>>>>> 2) Here it is with \x{}, for Perl/PCRE only:
>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\x{09}\x{0A}\x{0D}\x{20}-\x{
>>>>> 2C}\x{2E}-\x{5A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\
>>>>> \\^)|(\\\-)|(\\))+(-)?([\x{09}\x{0A}\x{0D}\x{20}-\x{2C}\x{2E}-\x{5
>>>>> A}\x{5F}-\x{D7FF}\x{E000}-\x{FFFD}]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\
>>>>> \))+)*\]))*$
>>>>>
>>>>>
>>>>> 3) And here is a regular expression that matches a subset of our
>>>>> subset, limited to Plane 0, with the \u escape:
>>>>> ^((\.((\*|\+)|(\*\?|\+\?))?)|(\[\^?(([\u0009\u000A\u000D\u0020-\u0
>>>>> 02C\u002E-\u005A\u005F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(
>>>>> \\\-)|(\\))+(-)?([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005A\u00
>>>>> 5F-\uD7FF\uE000-\uFFFD]|(\\\[)|(\\\])|(\\\^)|(\\\-)|(\\))+)*\]))*$
>>>>>
>>>>>
>>>>> 4) And remember, the backslashes and escaped backslashes are
>>>>> significant to the regular expression engine. If you're putting
>>>>> that into a string in a language like Java or C#, you need to
>>>>> escape the escapes:
>>>>> re = new
>>>>> Regex("^((\\.((\\*|\\+)|(\\*\\?|\\+\\?))?)|(\\[\\^?(([\\u0009\\u00
>>>>> 0A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000-\\uF
>>>>> FFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+(-)?([\\u0009
>>>>> \\u000A\\u000D\\u0020-\\u002C\\u002E-\\u005A\\u005F-\\uD7FF\\uE000
>>>>> -\\uFFFD]|(\\\\\\[)|(\\\\\\])|(\\\\\\^)|(\\\\\\-)|(\\\\))+)*\\]))*
>>>>> $");
>>>>>
>>>>> I'll proceed now to draft text explaining importance of Unicode
>>>>> normalization and best practices, that's Action-430.
>>>>>
>>>>> Cheers,
>>>>> Pablo.
>>>>> __________________________________
>>>>>
>>>>> Hi Jirka,
>>>>>
>>>>> It should not match invalid expressions since it only support
>>>>> character classes, ranges and negations, but still needs a bit of
>>>>> polishing regarding escapes. I don't think we need a BNF grammar,
>>>>> but it's not mine to decide, I just doing what I'm supposed to.
>>>>>
>>>>> Cheers,
>>>>> Pablo.
>>>>> __________________________________
>>>>>
>>>>> On 4.4.2013 17:12, Pablo Nieto Caride wrote:
>>>>>> Please, implementers and whoever that is interested, give
>>>>>> feedback if necessary so I can move forward and evolve the regex.
>>>>> Hi,
>>>>>
>>>>> since such complex regular expressions are mostly write-only (it's
>>>>> very hard to understand what they are trying to match) I'm not
>>>>> sure what's the point of having this complex regular expression
>>>>> for checking our regular expression syntax subset. I haven't tried
>>>>> to get deep understanding of this expression but I bet it will
>>>>> match even invalid expressions. If we want to have rigorous
>>>>> definition of our RE syntax we should provide its definition as
>>>>> grammar written in BNF.
>>>>>
>>>>> Jirka
>>>>>
>>>>> --
>>>>> ------------------------------------------------------------------
>>>>> Jirka Kosek e-mail: jirka@kosek.cz http://xmlguru.cz
>>>>> ------------------------------------------------------------------
>>>>> Professional XML consulting and training services
>>>>> DocBook customization, custom XSLT/XSL-FO document processing
>>>>> ------------------------------------------------------------------
>>>>> OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
>>>>> ------------------------------------------------------------------
>>>>> Bringing you XML Prague conference http://xmlprague.cz
>>>>> ------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>
>
Received on Monday, 8 April 2013 10:45:38 UTC