RE: [ACTION-160] (related to [ACTION-135] too) Summarize specialRequirements from Yves Savourel on 2012-07-11 (public-multilingualweb-lt@w3.org from July 2012)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Wed, 11 Jul 2012 11:07:02 +0200
To: "'Arle Lommel'" <arle.lommel@dfki.de>, "'Multilingual Web LT Public List'" <public-multilingualweb-lt@w3.org>
Cc: "'Fredrik Estreen'" <fredrik.estreen@lionbridge.com>
Message-ID: <assp.05397d4f1c.assp.05393c6b5a.002f01cd5f44$89bdbae0$9d3930a0$@com>
Hi Arle,

I like this modest proposal.

Providing [abc], [^abc] and [a-z] in addition to the enumeration would help.
And those are standard notations:

Here is a list of many regex features and how they are compatible.
http://www.regular-expressions.info/refflavors.html

The only issues I would see are:

- the ASCII vs Unicode for things like \w \s, etc.
We would have probably to decide one way or the other.

- the \x{NNN} and \p{...}
Which are not always supported all all engines.
But maybe we can limit the number of engines to the main ones: Perl5, Java, .NET, Phyton, XML.
Then things can be simple.

-ys


From: Arle Lommel [mailto:arle.lommel@dfki.de] 
Sent: Wednesday, July 11, 2012 10:30 AM
To: Multilingual Web LT Public List
Cc: Fredrik Estreen
Subject: Re: [ACTION-160] (related to [ACTION-135] too) Summarize specialRequirements

Hi all,

(If you don't want to read about why I think something more than simple enumeration is vital, cut to the section called A MODEST PROPOSAL below.)

Michael writes the following:

we would definately opt for a solution that would at least allow us to enumerate forbidden characters (using unicode pointers as you suggested)

Full enumeration could be a real pain if your set starts looking like this:

<span its-forbiddenchars="あぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ゛゜ゝゞゟ">Some Japanese content here where Hirgana is not allowed</span>

(Now imagine you need to exclude all CJK glyphs…)

So it's clear to me that at least some regex syntax would be needed: the part that allows range selector and inverse range selector. That would allow you to have something like the following:

<span its-forbiddenchars="[あ-ゟ]">Some Japanese content here where Hirgana is not allowed</span>

Or the following, where only Katakana and spaces are allowed (i.e., anything NOT in the range of Katakana + space is NOT allowed):

<span its-forbiddenchars="[^ア-ヿ ]">アレロ ロメル</span>

So at the very least we need to support the range selectors. Even better would be to support limited Unicode regular expressions like this as well:

<span its-forbiddenchars="[^\p{Han}\p{Hiragana}\p{Katakana}\p{InCJK_Symbols_and_Punctuation}]">おかがきぎ</span>

I am very aware that allowing arbitrary regular expressions, regardless of syntax, is asking for hurt since they can go beyond defining forbidden characters to defining forbidden patterns (which would be vastly more complex from a support standpoint). For example, if we go beyond allowing character enumerations and ranges, we (theoretically at least) have to deal with nastiness like this:

<span its-forbiddenchars="(?#Calandar from January 1st 1 A.D to December 31, 9999 )(?# in yyyy-mm-dd format )(?!(?:1582\D10\D(?:0?[5-9]|1[0-4]))|(?#Missing days from 1582 )(?:1752\D0?9\D(?:0?[3-9]|1[0-3]))(?#or Missing days from 1752 )(?# both sets of missing days should not be in the same calendar so remove one or the other))(?n:^(?=\d)(?# the character at the beginning a the string must be a digit )((?'year'\d{4})(?'sep'[-./])(?'month'0?[1-9]|1[012])\k'sep'(?'day'(?<!(?:0?[469]|11).)31|(?<!0?2.)30|2[0-8]|1\d|0?[1-9]|(?# if feb 29th check for valid leap year )(?:(?<=(?!(?#exclude these years from leap year pattern ) 000[04](?#No year 0 and no leap year in year 4 )|(?:(?:1[^0-6]|[2468][^048]|[3579][^26])00)(?# centurial years > 1500 not evenly divisible by 400 are not leap year))(?:(?:\d\d)(?# century)(?:[02468][048]|[13579][26])(?#leap years))\k'sep'(?:0?2)\k'sep')|(?# else if not Feb 29 )(?<!\k'sep'(?:0?2)\k'sep')(?# and day not Feb 30 or 31 ))29)(?(?=\x20\d)\x20|$))?(?# if there is a space followed by a digit check for time )(?<time>((?# 12 hour format )(0?[1-9]|1[012])(?# hours )(:[0-5]\d){0,2}(?# optional minutes and seconds )(?i:\x20[AP]M)(?# required AM or PM ))|(?# 24 hour format )([01]\d|2[0-3])(?#hours )(:[0-5]\d){1,2})(?#required minutes optional seconds )?$)">Anything but a properly formatted date</span>

(I found that nice piece online as an example of full date parsing and validation with regex. Unfortunately I can't actually get it to parse in my tools (it must use a non-PCRE flavor), as it is the craziest piece of regex I've come across)

A MODEST PROPOSAL
So we do need to constrain this, but we also need to go beyond simple enumeration, so how about the following proposal for the attribute value of its-forbiddenchars?

• operators: [, ], -, and ^ (i.e., standard range operators) (actually, I suspect that the [ and ] should be optional and implied if not present since without them you are already moving beyond defining ranges to defining patterns: best practice would be to include them to make the semantics clear, but if they are missing, just imply their existence.
o Note that because we are discussing forbidden characters, the ^ operator would actually mean to allow only the characters in the range, a reversal of its normal semantics since we are effectively stating a double negation (i.e., NOT NOT these characters)
• special characters: \r, \n, \t, \f, \c (for control characters), \x and \x{NNNN}, \\, \(, \), \[, \], \{, \}
• character classes: \s, \S, \w, \W, \d, \D, \p{} (UNICODE classes)

As you may note, just to make transfer of existing character classes easier, I would recommend that any characters that would have to be slash escaped in a normal regex environment be escaped here as well, as well as any single or double quote marks that would otherwise match the attribute value delimiters. E.g., you would need to have something like the following to include a quote:

<span its-forbiddenchars="[\"']">Span where no quote marks are allowed</span>

No idea if this would meet needs or not (I'm sure it would need refinement), but I thought I would throw out what it took me about 20 minutes to cook up.

Hope that helps,

Arle

On Jul 10, 2012, at 18:16 , Michael Kruppa wrote:


Hi Felix,all, 

from our rather technical point of view, the forbidden characters are highly relevant if not to say absolutely necessary in order to avoid certain problems. If we can not agree on an approach based on regular expressions due to the inherent complexity, we would definately opt for a solution that would at least allow us to enumerate forbidden characters (using unicode pointers as you suggested). 

For us, the regex solution would be of potential interest, but the simple enumeration approach would suffice for the current purpose we have in mind.

Best

Micha
Received on Wednesday, 11 July 2012 09:07:45 UTC