RE: forbiddenCharacters data category - related to [ACTIOn-189] from Yves Savourel on 2012-08-27 (public-multilingualweb-lt@w3.org from August 2012)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Mon, 27 Aug 2012 08:36:31 -0600
To: <public-multilingualweb-lt@w3.org>
Message-ID: <assp.058659313d.assp.05869b0834.000901cd8461$59bc28d0$0d347a70$@com>

Hi Arle, Felix, Jirka, Michael. All,

While Jirka’s solution would help in reducing the number of cases where C0-etc. need to be specified, I think it still doesn’t provide a full solution: we would not be able to allow such characters.

Another aspect to take into account is that—at least from my experience—it’s often much easier to define the list of forbidden characters than the list of allowed characters.

Also, I agree with Arle that the simple enumeration option is very limitative. And while it may work for some use cases, including Michael’s, it will very quickly become un-useable in practice.

If we can't find a regex solution, I would argue that we should drop the data category rather than define one with a very limited capability.

If we look again at the issue: it seems to boils down to two problems: a) the inability of the XML regex to handle invalid XML characters, and b) \uHHHH not supported by all engines.

For a) it's an XML problem that keeps coming back in localization. More and more formats are working around that problem because they need to store those characters in XML. For example Unicode-LDML, TS and XLIFF2 define elements to 'escape' them. A use case like forbidden characters in Windows' path is a good illustration that we do need to support those characters in the regex.

For b) it's ok to drop the \uHHHH notation in most cases because &#xHHHH; could be used instead.

The only case where it could not be used is the invalid XML characters. So maybe we could restrict the use of \uHHHH to only those characters. Applications that use engine not supporting such notation (only few) would have to map the sequence to literal characters (easy to do). The only case left would be the XML regex engine...

I would say: it's just too bad, but that engine is simply not good enough to implement fully the data category. Why should we restrict the capability of a data category because based on the capability of that regex engine? We can provide a note stating so in the specification.

This would give use something like:

----------

The set of characters that are forbidden is specified using a regular expression limited to a simple set pattern supported by most regular expression engines:

- The set is defined between square brackets ('[', and ']').
- One or more operators '-' MAY be used to indicate ranges.
- The prefix '^' may be used just after the opening bracket to invert the selection.
- The characters '[', ']', '-', '^' and '\' MUST be prefixed with '\' when used as literal.
- The characters invalid in XML (http://www.w3.org/TR/REC-xml/#charsets), and only them, MUST be expressed using the notation \uHHHH, where HHHH is the Unicode code point of the character.

Note: Applications using regular expression engines that do not support \uHHHH need to map such sequence to literal characters before applying the expression.

Note: Applications using the XML Schema regular expression engine ([[Link needed]]) are not able to support the \uHHHH notation because the engine does not support characters invalid in XML.

Examples (seen as XML source):

• [abc] disallows the characters 'a', 'b' and 'c'.
• [a-c] disallows the characters 'a', 'b' and 'c'.
• [a-cA-C] disallows the characters 'a', 'b', 'c', 'A', 'B', and 'C'.
• [^abc] disallows any characters except 'a', 'b', and 'c'.
• [^a-c] disallows any characters except 'a', 'b', and 'c'.
• [&#x0061;-c] disallows any characters except 'a', 'b', and 'c'.
• [\-\[\]] disallows the characters '-', '[' and ']'.
• [\u0000-\u001F&lt;&gt;:"\\/|\?*] disallowed the reserved characters for Windows file names.

----------

One last note: I think, in most implementations, it will be a lot easier to implement the regex solution than the simple Unicode code point enumeration: in many cases you just use the value of forbiddenCharacters as it.

Cheers,
-yves

Received on Monday, 27 August 2012 14:37:08 UTC