AW: I18N-ISSUE-246: Clarify character encoding behavior when calculating storage size [ITS-20]

Hello, 

coming back to  the issue of error handling when storage size is processed.  Would you agree to adding the following note to the definition of the data category as a resolution:

NOTE:  In order to be able to evaluate a Storage Size constraint an application must be able to encode the content of the selected nodes in the specified character encoding. An application that evaluates Storage Size but does not support the specified character encoding must report this as an error. If the selected nodes contain characters that the specified character encoding cannot represent, the processor must also report this as an error. The application evaluating the Storage Size constraint is not necessarily the ITS processor itself. The constraint may rather be evaluated by applications consuming the ITS encoded data in later steps. In such cases the above requirement pertains to those ITS consuming applications.

Best regards
Stephan

-----Ursprüngliche Nachricht-----
Von: Norbert Lindenberg [mailto:w3@norbertlindenberg.com] 
Gesendet: Donnerstag, 28. Februar 2013 08:26
An: Felix Sasaki
Cc: Norbert Lindenberg; Yves Savourel; public-multilingualweb-lt-comments@w3.org; 'www-international'
Betreff: Re: I18N-ISSUE-246: Clarify character encoding behavior when calculating storage size [ITS-20]


On Feb 27, 2013, at 14:55 , Felix Sasaki wrote:

> Hi Yves, Norbert, all,
> 
> Am 27.02.13 13:50, schrieb Yves Savourel:
>> Hi Norbert,
>> 
>>>> Note also that we have no way to check conformance of the 
>>>> applications using the ITS data for such mandatory support: ITS 
>>>> processors just pass the data along, they don't act on them (in the 
>>>> case of this data category).
>>> So who does actually act if a string is too long to fit into the 
>>> specified storage?
>> There is certainly the case of applications that do process ITS markup and apply it to the content directly: For example a JavaScript in an HTML5 page. But there are also applications that use an ITS processor to feed the content and the ITS information to a distinct system where the information is then applied. They correspond, for example, to the "Localization Workflow Managers" described in the "potential users of ITS"[1].
>> 
>> So I think it's important to make the distinction between the 'ITS processor' which act on the markup, and the 'consumer of ITS information' (for lack of a better name) that applies the ITS information. Both can be the same application, but they may also be separate ones.
>> 
>> This means a storage-size constraint can be applied completely 
>> outside the original XML/HTML5 document with tools that have no 
>> relations with the ITS processor itself, or with XML/HTML5 for that 
>> matter. Examples of such applications are localization quality 
>> checking tools (like CheckMate, XBench, QA-Distiller, etc.)
>> 
>> This is why, from my viewpoint, requiring the 'consumer of ITS information' to support UTF-8 is not important. And I was looking at the case for consumers that don't have a need for UTF-8, and whether we should really foist on them such a requirement.
>> 
>> To answer your question "Do you really want to let systems that can represent less than 1% of Unicode advertise themselves as ITS 2.0 conformant?": Why not? If the context where they are utilized is using only 1% of Unicode, why should they be forced to support more? I see many customers that never work outside of Latin-1.
>> 
>> This said, supporting UTF-8 is very easy nowadays and promoting its support is a good thing too. So in the interest of moving forward and of promoting better internationalization, I see no problem requiring the consumer of storage-size to support UTF-8.
>> 
>> The only thing that bother me a little is that such conformance as well as the parts about handling errors, apply to the consumer of the ITS information, not really the ITS processor, and I'm not sure the scope of our tests can cover that.
>> 
>> 
>> With regards to the error handling:
>> 
>>> It could be as simple as "If an ITS processor doesn't support the 
>>> specified character encoding, it must report this as an error and 
>>> terminate processing. If the selected nodes contain characters that 
>>> the specified character encoding cannot represent, the processor 
>>> must report this as an error and terminate processing." Or you could 
>>> try and be nice in the second case and specify a fallback strategy, 
>>> e.g., by saying that the first replacement character among U+FFFD, 
>>> U+003F,
>>> U+FF1F that can be represented in the specified character encoding
>>> must be used instead of any character that can't.
>> I would favor a more practical behavior:
>> 
>> "If the application applying the information doesn't support the specified character encoding, it must report this as an error.
> 
> One important aspect of above sentence is that - as Yves pointed out - the "must" would be a lower case "must". That is, this will be no testable assertation of the ITS 2.0 specification, even if the spec says "the consumer must support UTF-8". In that sense, we might even put that requirement into a note, to make clear that from the ITS 2.0 point of view this is rather guidance than a normative statement. Would that work for you too, Norbert?

While it's much better if assertions can be and are tested, keep in mind that a test suite generally can't prove that a system fully conforms to a spec - it can only show in some cases that it doesn't. And even if this requirement isn't testable by software, there's still the test of looking into the developer's eyes and asking "does your system support UTF-8?". Notes are not requirements, so turning this into a note would remove the basis for asking the question.

Norbert

Received on Tuesday, 19 March 2013 10:41:43 UTC