Re: [ACTION-160] (related to [ACTION-135] too) Summarize specialRequirements from Felix Sasaki on 2012-07-11 (public-multilingualweb-lt@w3.org from July 2012)

From: Felix Sasaki <fsasaki@w3.org>
Date: Wed, 11 Jul 2012 11:19:19 +0200
To: Michael Kruppa <Michael.Kruppa@cocomore.com>
Cc: Yves Savourel <ysavourel@enlaso.com>, Arle Lommel <arle.lommel@dfki.de>, Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>, Fredrik Estreen <fredrik.estreen@lionbridge.com>
Message-ID: <CAL58czq0k9qBU0ymaeuaxioAguTnbm6Ye+yoXNjMj4rZiG6Ajg@mail.gmail.com>
We had also input from Microsoft (Ian) that they are working on a solution
involving regex. So if the "enumeration only" works for Michael's /
Cocomores scenario, I would disagree with going even a simple regex
approach, to avoid too many solutions in the same problem space.

Felix

2012/7/11 Michael Kruppa <Michael.Kruppa@cocomore.com>

> Hi Yves, Arle, all,
>
> I totally agree that Arle's proposal is very reasonable. If we can clarify
> the points Yves made, this would be a good solution from our point of view.
>
> I would just like to clarify that I meant to say, that the enumeration
> approach would be sufficient for us in order to avoid data storage and html
> integration problems.
> It is of course in no way sufficient for the examples Arle has given.
>
> Cheers
>
> Micha
>
> ________________________________________
> Dr. Michael Kruppa, Senior IT-Consultant
> Tel.: +49 69 972 69 189 Fax: +49 69 972 69 204; E-Mail:
> michael.kruppa@cocomore.com
> Cocomore AG, Gutleutstraße 30, D-60329 Frankfurt
> Internet: http://www.cocomore.de Facebook:
> http://www.facebook.com/cocomore Google+: http://plus.cocomore.de
> Cocomore ist aktives Mitglied im World Wide Web Consortium (W3C) und im
> Bundesverband Digitale Wirtschaft (BVDW)
> Cocomore is active member of the World Wide Web Consortium (W3C)
> Vorstand: Dr. Hans-Ulrich von Freyberg (Vors.), Dr. Jens Fricke, Marc
> Kutschera, Vors. des Aufsichtsrates: Martin Velasco, Sitz: Frankfurt/Main,
> Amtsgericht Frankfurt am Main, HRB 51114
>
>
>
> -----Ursprüngliche Nachricht-----
> Von: Yves Savourel [mailto:ysavourel@enlaso.com]
> Gesendet: Mittwoch, 11. Juli 2012 11:07
> An: 'Arle Lommel'; 'Multilingual Web LT Public List'
> Cc: 'Fredrik Estreen'
> Betreff: RE: [ACTION-160] (related to [ACTION-135] too) Summarize
> specialRequirements
>
> Hi Arle,
>
> I like this modest proposal.
>
> Providing [abc], [^abc] and [a-z] in addition to the enumeration would
> help.
> And those are standard notations:
>
> Here is a list of many regex features and how they are compatible.
> http://www.regular-expressions.info/refflavors.html
>
> The only issues I would see are:
>
> - the ASCII vs Unicode for things like \w \s, etc.
> We would have probably to decide one way or the other.
>
> - the \x{NNN} and \p{...}
> Which are not always supported all all engines.
> But maybe we can limit the number of engines to the main ones: Perl5,
> Java, .NET, Phyton, XML.
> Then things can be simple.
>
> -ys
>
>
> From: Arle Lommel [mailto:arle.lommel@dfki.de]
> Sent: Wednesday, July 11, 2012 10:30 AM
> To: Multilingual Web LT Public List
> Cc: Fredrik Estreen
> Subject: Re: [ACTION-160] (related to [ACTION-135] too) Summarize
> specialRequirements
>
> Hi all,
>
> (If you don't want to read about why I think something more than simple
> enumeration is vital, cut to the section called A MODEST PROPOSAL below.)
>
> Michael writes the following:
>
> we would definately opt for a solution that would at least allow us to
> enumerate forbidden characters (using unicode pointers as you suggested)
>
> Full enumeration could be a real pain if your set starts looking like this:
>
> <span
> its-forbiddenchars="あぃいぅうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろゎわゐゑをんゔゕゖ゛゜ゝゞゟ">Some
> Japanese content here where Hirgana is not allowed</span>
>
> (Now imagine you need to exclude all CJK glyphs…)
>
> So it's clear to me that at least some regex syntax would be needed: the
> part that allows range selector and inverse range selector. That would
> allow you to have something like the following:
>
> <span its-forbiddenchars="[あ-ゟ]">Some Japanese content here where Hirgana
> is not allowed</span>
>
> Or the following, where only Katakana and spaces are allowed (i.e.,
> anything NOT in the range of Katakana + space is NOT allowed):
>
> <span its-forbiddenchars="[^ア-ヿ ]">アレロ ロメル</span>
>
> So at the very least we need to support the range selectors. Even better
> would be to support limited Unicode regular expressions like this as well:
>
> <span
> its-forbiddenchars="[^\p{Han}\p{Hiragana}\p{Katakana}\p{InCJK_Symbols_and_Punctuation}]">おかがきぎ</span>
>
> I am very aware that allowing arbitrary regular expressions, regardless of
> syntax, is asking for hurt since they can go beyond defining forbidden
> characters to defining forbidden patterns (which would be vastly more
> complex from a support standpoint). For example, if we go beyond allowing
> character enumerations and ranges, we (theoretically at least) have to deal
> with nastiness like this:
>
> <span its-forbiddenchars="(?#Calandar from January 1st 1 A.D to December
> 31, 9999 )(?# in yyyy-mm-dd format
> )(?!(?:1582\D10\D(?:0?[5-9]|1[0-4]))|(?#Missing days from 1582
> )(?:1752\D0?9\D(?:0?[3-9]|1[0-3]))(?#or Missing days from 1752 )(?# both
> sets of missing days should not be in the same calendar so remove one or
> the other))(?n:^(?=\d)(?# the character at the beginning a the string must
> be a digit
> )((?'year'\d{4})(?'sep'[-./])(?'month'0?[1-9]|1[012])\k'sep'(?'day'(?<!(?:0?[469]|11).)31|(?<!0?2.)30|2[0-8]|1\d|0?[1-9]|(?#
> if feb 29th check for valid leap year )(?:(?<=(?!(?#exclude these years
> from leap year pattern ) 000[04](?#No year 0 and no leap year in year 4
> )|(?:(?:1[^0-6]|[2468][^048]|[3579][^26])00)(?# centurial years > 1500 not
> evenly divisible by 400 are not leap year))(?:(?:\d\d)(?#
> century)(?:[02468][048]|[13579][26])(?#leap
> years))\k'sep'(?:0?2)\k'sep')|(?# else if not Feb 29
> )(?<!\k'sep'(?:0?2)\k'sep')(?# and day not Feb 30 or 31
> ))29)(?(?=\x20\d)\x20|$))?(?# if there is a space followed by a digit check
> for time )(?<time>((?# 12 hour format )(0?[1-9]|1[012])(?# hours
> )(:[0-5]\d){0,2}(?# optional minutes and seconds )(?i:\x20[AP]M)(?#
> required AM or PM ))|(?# 24 hour format )([01]\d|2[0-3])(?#hours
> )(:[0-5]\d){1,2})(?#required minutes optional seconds )?$)">Anything but a
> properly formatted date</span>
>
> (I found that nice piece online as an example of full date parsing and
> validation with regex. Unfortunately I can't actually get it to parse in my
> tools (it must use a non-PCRE flavor), as it is the craziest piece of regex
> I've come across)
>
> A MODEST PROPOSAL
> So we do need to constrain this, but we also need to go beyond simple
> enumeration, so how about the following proposal for the attribute value of
> its-forbiddenchars?
>
> • operators: [, ], -, and ^ (i.e., standard range operators) (actually, I
> suspect that the [ and ] should be optional and implied if not present
> since without them you are already moving beyond defining ranges to
> defining patterns: best practice would be to include them to make the
> semantics clear, but if they are missing, just imply their existence.
> o Note that because we are discussing forbidden characters, the ^ operator
> would actually mean to allow only the characters in the range, a reversal
> of its normal semantics since we are effectively stating a double negation
> (i.e., NOT NOT these characters) • special characters: \r, \n, \t, \f, \c
> (for control characters), \x and \x{NNNN}, \\, \(, \), \[, \], \{, \} •
> character classes: \s, \S, \w, \W, \d, \D, \p{} (UNICODE classes)
>
> As you may note, just to make transfer of existing character classes
> easier, I would recommend that any characters that would have to be slash
> escaped in a normal regex environment be escaped here as well, as well as
> any single or double quote marks that would otherwise match the attribute
> value delimiters. E.g., you would need to have something like the following
> to include a quote:
>
> <span its-forbiddenchars="[\"']">Span where no quote marks are
> allowed</span>
>
> No idea if this would meet needs or not (I'm sure it would need
> refinement), but I thought I would throw out what it took me about 20
> minutes to cook up.
>
> Hope that helps,
>
> Arle
>
> On Jul 10, 2012, at 18:16 , Michael Kruppa wrote:
>
>
> Hi Felix,all,
>
> from our rather technical point of view, the forbidden characters are
> highly relevant if not to say absolutely necessary in order to avoid
> certain problems. If we can not agree on an approach based on regular
> expressions due to the inherent complexity, we would definately opt for a
> solution that would at least allow us to enumerate forbidden characters
> (using unicode pointers as you suggested).
>
> For us, the regex solution would be of potential interest, but the simple
> enumeration approach would suffice for the current purpose we have in mind.
>
> Best
>
> Micha
>
>
>
>
>


-- 
Felix Sasaki
DFKI / W3C Fellow
Received on Wednesday, 11 July 2012 09:19:56 UTC