W3C home > Mailing lists > Public > public-multilingualweb-lt@w3.org > July 2012

Re: [ACTION-135] specialRequirements flesh out

From: Arle Lommel <arle.lommel@dfki.de>
Date: Wed, 4 Jul 2012 16:07:39 +0200
Cc: "'Giuseppe Deriard [Linguaserve I.S. SA]'" <giuseppe.deriard@linguaserve.com>, "'Dr. David Filip'" <David.Filip@ul.ie>, '"Pedro L. Díez Orzas"' <pedro.diez@linguaserve.com>, <public-multilingualweb-lt@w3.org>
Message-Id: <7FE864F9-ED83-449F-A488-2D3148C84410@dfki.de>
To: Yves Savourel <ysavourel@enlaso.com>
Hi all,

I think Yves is right on this. What would someone do with a reg-ex that looks like this in a list of forbidden characters?

(?<![MAC])(?imsx)([aðő]).+($(qw)\1|d)

It's a total junk (PCRE) regex as far as any real use, but it is entirely valid, and completely useless for these kind of purposes. When we start adding lookahead and lookbehind assertions and so forth, you end up with something very difficult to implement and use.

While I doubt anyone would use something like that, I think a restricted subset, as Yves suggests, makes a lot of sense. If you're at the point where you would use an ugly regex like the one I made, you are probably trying to implement something too complex for this mechanism.

Maybe bit makes sense to treat the string as though in an implicit […] and allow the metacharacters like \n, \r, \d, etc. Obviously if you have "." as your string, you are in big trouble, so we should restrict the values to exclude some metacharacters.

-Arle

On Jul 4, 2012, at 15:48 , Yves Savourel wrote:

> Hi Giuseppe, all,
>  
> Thanks for the re-work. It looks better to me now.
>  
> I would have one additional suggestion.
>  
> Allowing to use various regex syntaxes and indicating which one, is both a good an a bad solution. It’s good because it allows flexibility, it’s bad because it’s a big hurdle for interoperability.
>  
> What will be the conformance requirement? a) That a tool supports all regex syntaxes for which we provide a list? Or b) it supports at least one. If we go with a) we end up with a massive task for the tools. If we go with b) we risk a lot of non-interoperable tools.
>  
> So, in this specific case of using a regex to list a set of forbidden characters, I would suggest to actually provide a finite list of the regex expressions allowed. Why? Because it’s very likely that only simple patterns are needed in 99% of the cases, and they can be written with expressions that are compatibles in many engines. That is: something like [^abc] (anything that is neither ‘a’ or ‘b’ or ‘c’) is likely to work everywhere.
>  
> Note, that I’m not volunteering to come up with the set of expressions :), but merely suggesting it’s a possible direction that may be quite flexible and interoperable.
>  
> Cheers,
> -yves
>  
>  
>  
> From: Giuseppe Deriard [Linguaserve I.S. SA] [mailto:giuseppe.deriard@linguaserve.com] 
> Sent: Wednesday, July 04, 2012 9:40 AM
> To: 'Dr. David Filip'; 'Arle Lommel'
> Cc: Yves Savourel; '"Pedro L. Díez Orzas"'; public-multilingualweb-lt@w3.org; giuseppe.deriard@linguaserve.com
> Subject: RE: [ACTION-135] specialRequirements flesh out
>  
> Hi Yves, hi Arle, hi David,
>  
> Based on your interesting comments, I changed my proposal as follows. I’m waiting your feedback.
>  
> specialRequirements
> maxStorageSize
> Declare a field storage limitation used in combination with encoding parameter.
> encoding
> Declare the encoding type. For example: UTF-16. 
> maxDisplayLength
> Declare a word length limitation. For example, the text displayed on a display panel with a maximum width of 30 characters.
> forbiddenChar
> Declare a ban on use of a character used in combination with regexType parameter. For example: Do not use the single quote in the translated text, do not use “<” or ”>”
> regexType
> Declare what regex we use for forbiddenChar. For example: ICU
>  
> Implementation examples
> <its:specialRequirements maxStorageSize="200" encoding="UTF-16" maxDisplayLength="30"forbiddenChar="\’" regexType="Java">
> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
> </its:specialRequirements>
>   
> <span its-specialRequirements=" maxStorageSize:200; encoding="UTF-16"; maxDisplayLength:30forbiddenChar:\’; regexType:Java">
> Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
> </span>
>  
> Please, feel free to do any change you consider.
>  
> Cheers,
>  
> Giuseppe Deriard
> IT Director
> Linguaserve I.S. S.A.
> Tel.:    +34 91 761 64 60
> Mob.: +34 657 958 677
> www.linguaserve.com
> giuseppe.deriard@linguaserve.com
> es.linkedin.com/in/gderiard
> "According to the provisions set forth in articles 21 and 22 of Law 34/2002 of July 11 regarding Information Society and eCommerce Services, we will store and use your personal data with the sole purpose of marketing the products and services offered by LINGUASERVE INTERNACIONALIZACIÓN DE SERVICIOS, S.A. If you do not wish your personal data to be stored and handled, or you do not wish to receive further information regarding products and services offered by our company, please e-mail us toclients@linguaserve.com. Your request will be processed immediately."
>  
> De: Dr. David Filip [mailto:David.Filip@ul.ie] 
> Enviado el: martes, 03 de julio de 2012 14:03
> Para: Arle Lommel
> CC: Yves Savourel; "Pedro L. Díez Orzas"; <public-multilingualweb-lt@w3.org>; Giuseppe Deriard [Linguaserve I.S. SA]
> Asunto: Re: [ACTION-135] specialRequirements flesh out
>  
> Hi all, I believe that length restrictions are important metadata and importantly, one that should be preserved throughout the localization roundtrip ergo XLIFF roundtrip.
>  
> Fredrik Estreen, is currently working on a draft for this and there are chances that his solution will make it into core XLIFF 2.0.
>  
> It is more or less inline with Yves thinking that he posted in this thread. Basically we need to discern between display size and storage size. Storage size seems more basic as it can be easily calculated if you know encoding, so encoding might be a required attribute here.
> The display size is more complicated and simply counting code points has limited usability if you come to think of it.
> So the display limitation (if at all used) mechanism should be open to private extensions handling sophisticated display rules including area size and shape, fonts etc. (again this sort of extensibility will be specified in Fredrik's draft)
>  
> Regarding the banned characters. It seems an unrelated topic, but worth encoding nevertheless. as in many cases we should not prescribe what regexp machine people use. Prescribing implementation details is a discouraged standardization prectice. Instead the user should be able to specify which regexp machine they are using. While pearl might seem nice, ICU is kind of canonical implementation of a Unicode compliant regexp machine. So I would not really exlude either here and let people choose what they want to use..
>  
> Rgds
> dF
>  
>  
> 
> Dr. David Filip
> =======================
> LRC | CNGL | LT-Web | CSIS
> University of Limerick, Ireland
> telephone: +353-6120-2781
> cellphone: +353-86-0222-158
> facsimile: +353-6120-2734
> mailto: david.filip@ul.ie
>  
> 
> On Tue, Jul 3, 2012 at 7:33 AM, Arle Lommel <arle.lommel@dfki.de> wrote:
> For what it’s worth, it seems that Perl5 regex enjoy broad acceptance and the syntax is more compact and easier to read that POSIX in come cases, so I would favor that one.
> 
> Arle
> 
> --
> Arle Lommel
> Berlin, Germany
> Skype: arle_lommel
> Phone (US): +1 707 709 8650
> 
> Sent from a mobile device. Please excuse any typos.
> 
> On Jul 3, 2012, at 8:24, Yves Savourel <ysavourel@enlaso.com> wrote:
> 
> > Hi Pedro, Giuseppe, all,
> >
> > Thanks for the details for this data category.
> > Here are a few questions/notes:
> >
> > - For 'maxLengthChar' and 'maxlengthCharWord': I assume the unit is a Unicode code-point. Is that correct?
> >
> > - My understanding is that 'maxLengthChar' indicates the maximum size the text can have when serialized in its storage and 'maxlengthCharWord' is a maximum display size of sort. Is that correct? If that is the case 'maxLengthCharWord' could be renamed something like 'maxDisplayLength' and 'maxLengthChar' could be something like 'maxFieldSize' or 'maxStorageSize'.
> >
> > - For 'charRestricted': I would suggest the value of this attribute to be a regular expression that matches the forbidden characters. We would have to specify what regular expression 'standard' should be used (POSIX, ICU, Java, Perl5, etc.)
> >
> > - For 'charRestricted': It may also be better to name this attribute something like 'allowedChars' (and reverse the regex value), as 'restricted' is not very clear (it can be read as 'char restricted to' and a list of the only chars allowed.) Or call it 'forbiddenChars'.
> >
> > - while I see the relationship between restrictions of length and content, it seems those could be separate data categories. But I'm not sure if it's worth separating them either.
> >
> > Cheers,
> > -yves
> >
> >
> > From: Pedro L. Díez Orzas [mailto:pedro.diez@linguaserve.com]
> > Sent: Friday, June 29, 2012 4:56 PM
> > To: public-multilingualweb-lt@w3.org
> > Cc: Giuseppe Deriard [Linguaserve I.S. SA]
> > Subject: [ACTION-135] specialRequirements flesh out
> >
> > Hi all,
> >
> > Giuseppe sent me this about ACTION 135. Please, mind that the currently accepted “localizationNote” is a human readable info, while specialRequirements can be used by machines without human intervention. We see this data category as something quite “basic” and consequently necessary. Also, to confirm you that will provide already one implementation for specialRequirements in WP3, so we would need only another one.
> >
> > Here the specialRequirements flesh out.
> >
> > maxLengthChar
> > Declare a limitation on the number of characters allowed in the field.
> >
> > maxLengthCharWord
> > Declare a word length limitation. For example, the text display on a display panel with a maximum width of 30 characters.
> >
> > charRestricted
> > Declare a ban on use of a character. For example: Do not use the single quote in the translated text, do not use “<” or ”>”
> >
> > <its:specialRequirements maxLengthChar="200" maxLengthCharWord="30" charRestricted="’">
> > Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
> > </its:specialRequirements>
> >
> >
> > <span its-specialRequirements="maxLengthChar:200; maxLengthCharWord:30 charRestricted:’">
> > Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
> > </span>
> >
> > Cheers,
> >
> > Giuseppe Deriard
> > IT Director
> > Linguaserve I.S. S.A.
> > Tel.:    +34 91 761 64 60
> > Mob.: +34 657 958 677
> > www.linguaserve.com
> > giuseppe.deriard@linguaserve.com
> > es.linkedin.com/in/gderiard
> > "According to the provisions set forth in articles 21 and 22 of Law 34/2002 of July 11 regarding Information Society and eCommerce Services, we will store and use your personal data with the sole purpose of marketing the products and services offered by LINGUASERVE INTERNACIONALIZACIÓN DE SERVICIOS, S.A. If you do not wish your personal data to be stored and handled, or you do not wish to receive further information regarding products and services offered by our company, please e-mail us to clients@linguaserve.com. Your request will be processed immediately."
> > ________________________________________
> >
> > Best,
> > Pedro
> >
> >
> 
>  


Received on Wednesday, 4 July 2012 14:08:18 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:31:47 UTC