Re: forbiddenCharacters data category - related to [ACTIOn-189] from Arle Lommel on 2012-08-27 (public-multilingualweb-lt@w3.org from August 2012)

From: Arle Lommel <arle.lommel@dfki.de>
Date: Mon, 27 Aug 2012 14:55:45 +0200
To: Multilingual Web LT Public List <public-multilingualweb-lt@w3.org>
Message-Id: <9DFD9238-2D6C-4B77-AB5F-72D7ABE35AF0@dfki.de>

> I would propose to avoid the regex completely then, since it seems that then the proposal from Jirka at 
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Aug/0280.html
> wouldn't be a solution too.
> 
> We had concerns about the regex before, and Michael said this data category would fulfil his needs without the regex. So let's go forward with that. Otherwise we will create regex that don't work with the content we want them to work on

I'm concerned however, that we might end up with a solution then that might solve Michael’s need but which will be ignored by most of industry. If *I* were implementing this, I would see this category and think "ah, a standard way to restrict characters to specified range. Great!" But if the way to, for example, limit the translation to use only katakana is that I have to dump the entire Unicode repertoire, minus the katakana range, into this attribute, there would be *no way* I would touch it.

I know I'm not the implementer here, but my gut feeling is that localization tool developers would look at any mechanism that doesn't use some form of regex and decide to ignore it. There are a lot of tools out there that use regex-based pattern matching for content validation. These would be ideal implementers for what we do. But if we use strict enumeration, they will not touch what is done there and I'm afraid we end up creating an irrelevant portion of the standard. (That said, I realize we do not have implementation commitments for anything more complex, so I'm arguing for something in advance of the commitments, which is problematic in its own right.)

-Arl

Received on Monday, 27 August 2012 12:56:04 UTC