W3C home > Mailing lists > Public > public-multilingualweb-lt-comments@w3.org > January 2013

Comment on ITS 2.0 specification WD

From: Yves Savourel <ysavourel@enlaso.com>
Date: Thu, 3 Jan 2013 08:25:11 -0700
To: <public-multilingualweb-lt-comments@w3.org>
Message-ID: <assp.07153c3d28.assp.07155d54e4.000301cde9c6$8597d8d0$90c78a70$@com>

I have a comment on the use of XSD regular expression for Allowed Characters.

This is a topic we have discussed before. I disagreed with the solution but wanted to see the feature to move forward and see how others would implement it. So now I want to raise the issue one last time to see if anyone has had a chance to think more about it.

Note that ENLASO will not object if the outcome is to keep the feature as it. I just want it on record that we think it is not a good solution.

== Summary:

- Using the XML Schema Character Class regular expression syntax is less interoperable than a small sub-set of common regular expression constructs that most engines support.

- The data carried by the ITS data categories should be tool/platform agnostic whenever possible. Allowed Characters is the only data category that does not do this.

- There is no conformance tests that can validate whether a tool implements the mandated regular expression syntax.

Based on that, I strongly suggest to change the specification so the Allowed Characters data category uses a syntax that is a small common sub-set of what most regular expression engines support.

Note that such common sub-set would still be compatible and useable with implementations using a XML Schema regular expression engine. The change would also not affect any of the test cases as those use simple expressions that, I believe, would be covered with the common sub-set. In other words, I don't see the change as a substantial as it involves only to restrict the syntax to a sub-set but does not change the capabilities of the feature.

== Rationale:

In most cases, in ITS, the information passed to the consumer of the ITS markup is very basic: it is made of IRIs, list of identifiers/labels, string and numbers. They can be dealt with easily in any programming environment.

In one lone case, the Allowed Characters data category, the information passed on to the tool is a complex one: a regular expression. While it is expressed as a string, a consumer must understand its syntax to be able to use that information.

The current draft (http://www.w3.org/TR/2012/WD-its20-20121206/#allowedchars) states that the syntax to use is the one of the Character Classes of the XML Schema regular expression (http://www.w3.org/TR/xmlschema-2/#charcter-classes)

My concern here is that in this lone case ITS foists upon the consumers of the data category the use of the XML technology stack, even if the same goal can be obtain without such requirement (by consumer I mean the tool using the data, not the tool processing the ITS markup).

If we look at existing localization applications that perform QA-type verification (e.g. XBench, QA distiller, CheckMate, etc.) they are not especially geared toward using XML dependencies to run their features.
while I understand the drive to use a 'standard' syntax, I think that in the case of Allowed Characters, the exact same goals can be achieved without major added burden by using a small sub-set of regular expression constructs that are common to most of the regular expression engines.

There is no reason to impose upon the implementers the XSD-specific parts of the syntax. It is detrimental to interoperability and creates a risk of having non-compliant implementations that will still work to some degree with the mandated syntax, causing confusion.

It is also important to note that the Test Suite is not designed to validate how a tool uses the information carried through the ITS attributes, and therefore there are no conformance tests that can enforce the mandated syntax. The situation would be the same with a small common sub-set, but, in my opinion, it would have a lot more chances to be implemented properly.

Received on Thursday, 3 January 2013 15:25:39 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:32:26 UTC