W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > April 2013

[Issue-67] [Action-385] Work on regex for validating regex subset proposal

From: Pablo Nieto Caride <pablo.nieto@linguaserve.com>
Date: Thu, 4 Apr 2013 17:12:17 +0200
To: <public-multilingualweb-lt@w3.org>
Message-ID: <05a401ce3146$cb8d9b50$62a8d1f0$@linguaserve.com>
Hi all,

 

I made some headway on action 385. Just to summarise:

Yves raised an Issue
(https://www.w3.org/International/multilingualweb/lt/track/issues/67) on
Allowed Characters
(http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#
allowedchars) stating that using the XML Schema Character Class regular
expression syntax reduces interoperability
(http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013
Jan/0000.html).

 

Shaun did a research for Action-385
(https://www.w3.org/International/multilingualweb/lt/track/actions/385) and
came up with a small sub-set of common regular expressions supported by most
of the engines
(http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2013Jan/0180.
html) which is more or less the one Yves suggested before:

1.	character classes [abc]  [a-zA-Z_\-]
2.	ranges [a-c]
3.	negations [^abc]
4.	"^" and "]" must never appear unless backslash-escaped
5.	"-" may be backslash-escaped
6.	escape sequences "\n", "\r", "\t", "\d", and "\D"
7.	literal "\" is escaped as "\\"

 

Subsequently he developed a regex which Felix corrected, that is:

^(\.|\[\^?-?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5B;&#x5F;-&#xD7FF;&#
xE000;-&#xFFFD;&#x10000;-&#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\)(-([&#x09;
&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5B;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;&#x100
00;-&#x10FFFF;]|\\n|\\r|\\t|\\]|\\^|\\-|\\\\))?)+-?\])?$

but it doesn’t seem to work.

 

Since I don’t quite understand the regex structure chosen by Shaun, I took
the liberty of adapting it a little bit, I think that now is simpler and It
supports the sub-sets, character classes, ranges, negations, etc… plus
greedy and lazy operators (which I can drop if they’re no needed, but I
believe that most of engines use them and are usually helpful) but I’m still
working on it because it needs more work, for instance points 4, 5 and 7
work but without limitations.

So just like Shaun did:

Here is the proposed regular expression escaped with XML numeric character
entities, as if it were put into an XML document:

^(\.(\*|\+)?\??|\[\^?(([&#x09;&#x0A;&#x0D;&#x20;-&#x2C;&#x2E;-&#x5B;&#x5F;-&
#xD7FF;&#xE000;-&#xFFFD;&#x10000;-&#x10FFFF;])*-?([&#x09;&#x0A;&#x0D;&#x20;-
&#x2C;&#x2E;-&#x5B;&#x5F;-&#xD7FF;&#xE000;-&#xFFFD;&#x10000;-&#x10FFFF;])+)+
-?\])$

 

And here is a regular expression that matches a subset of our subset,
limited to Plane 1, with the \u escape (I tested it with PHP and JavaScript
and It works):

^(\.(\*|\+)?\??|\[\^?(([\u0009\u000A\u000D\u0020-\u002C\u002E-\u005B\u005F-\
uD7FF\uE000-\uFFFD\u10000-\u10FFFF])*-?([\u0009\u000A\u000D\u0020-\u002C\u00
2E-\u005B\u005F-\uD7FF\uE000-\uFFFD\u10000-\u10FFFF])+)+-?\])$

 

Please, implementers and whoever that is interested, give feedback if
necessary so I can move forward and evolve the regex. 

 

Cheers,

__________________________________

Pablo Nieto Caride

Dpto. Técnico/I+D+i

Linguaserve Internacionalización de Servicios, S.A.

Tel.: +34 91 761 64 60 ext. 0422
Fax: +34 91 542 89 28 

E-mail:  <mailto:pablo.nieto@linguaserve.com> pablo.nieto@linguaserve.com

 <http://www.linguaserve.com/> www.linguaserve.com

 

«En cumplimiento con lo previsto con los artículos 21 y 22 de la Ley
34/2002, de 11 de julio, de Servicios de la Sociedad de Información y
Comercio Electrónico, le informamos que procederemos al archivo y
tratamiento de sus datos exclusivamente con fines de promoción de los
productos y servicios ofrecidos por LINGUASERVE INTERNACIONALIZACIÓN DE
SERVICIOS, S.A. En caso de que Vdes. no deseen que procedamos al archivo y
tratamiento de los datos proporcionados, o no deseen recibir comunicaciones
comerciales sobre los productos y servicios ofrecidos, comuníquenoslo a
clients@linguaserve.com, y su petición será inmediatamente cumplida.»

 

"According to the provisions set forth in articles 21 and 22 of Law 34/2002
of July 11 regarding Information Society and eCommerce Services, we will
store and use your personal data with the sole purpose of marketing the
products and services offered by LINGUASERVE INTERNACIONALIZACIÓN DE
SERVICIOS, S.A. If you do not wish your personal data to be stored and
handled, or you do not wish to receive further information regarding
products and services offered by our company, please e-mail us to
clients@linguaserve.com. Your request will be processed immediately.”

__________________________________
Received on Thursday, 4 April 2013 15:12:53 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:32:07 UTC