RE: I18N-ISSUE-246: Clarify character encoding behavior when calculating storage size [ITS-20] from Yves Savourel on 2013-02-27 (www-international@w3.org from January to March 2013)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Wed, 27 Feb 2013 05:50:16 -0700
To: "'Norbert Lindenberg'" <w3@norbertlindenberg.com>
CC: <public-multilingualweb-lt-comments@w3.org>, "'www-international'" <www-international@w3.org>
Message-ID: <007401ce14e8$ff327c70$fd977550$@com>

Hi Norbert,

>> Note also that we have no way to check conformance of the 
>> applications using the ITS data for such mandatory support: ITS processors 
>> just pass the data along, they don't act on them 
>> (in the case of this data category).
>
> So who does actually act if a string is too long to fit into the 
> specified storage?

There is certainly the case of applications that do process ITS markup and apply it to the content directly: For example a JavaScript in an HTML5 page. But there are also applications that use an ITS processor to feed the content and the ITS information to a distinct system where the information is then applied. They correspond, for example, to the "Localization Workflow Managers" described in the "potential users of ITS"[1].

So I think it's important to make the distinction between the 'ITS processor' which act on the markup, and the 'consumer of ITS information' (for lack of a better name) that applies the ITS information. Both can be the same application, but they may also be separate ones.

This means a storage-size constraint can be applied completely outside the original XML/HTML5 document with tools that have no relations with the ITS processor itself, or with XML/HTML5 for that matter. Examples of such applications are localization quality checking tools (like CheckMate, XBench, QA-Distiller, etc.)

This is why, from my viewpoint, requiring the 'consumer of ITS information' to support UTF-8 is not important. And I was looking at the case for consumers that don't have a need for UTF-8, and whether we should really foist on them such a requirement.

To answer your question "Do you really want to let systems that can represent less than 1% of Unicode advertise themselves as ITS 2.0 conformant?": Why not? If the context where they are utilized is using only 1% of Unicode, why should they be forced to support more? I see many customers that never work outside of Latin-1.

This said, supporting UTF-8 is very easy nowadays and promoting its support is a good thing too. So in the interest of moving forward and of promoting better internationalization, I see no problem requiring the consumer of storage-size to support UTF-8.

The only thing that bother me a little is that such conformance as well as the parts about handling errors, apply to the consumer of the ITS information, not really the ITS processor, and I'm not sure the scope of our tests can cover that.


With regards to the error handling:

> It could be as simple as "If an ITS processor doesn't support the 
> specified character encoding, it must report this as an error and 
> terminate processing. If the selected nodes contain characters that 
> the specified character encoding cannot represent, the processor must 
> report this as an error and terminate processing." Or you could try 
> and be nice in the second case and specify a fallback strategy, e.g., 
> by saying that the first replacement character among U+FFFD, U+003F,
> U+FF1F that can be represented in the specified character encoding
> must be used instead of any character that can't.

I would favor a more practical behavior:

"If the application applying the information doesn't support the specified character encoding, it must report this as an error. If the content being verified contain characters that the specified character encoding cannot represent, the application must report this as an error."

a) The applications likely to implement the storage-size are checkers and it would make more sense for them to continue checking after reporting a problem.
b) The applications applying the information are not the 'ITS processors' but the 'consumer of ITS information', and at that point I think it's best to talk about content rather than nodes since the data may be completely outside a DOM.
c) I'm not sure falling back to a replacement character is a good thing. If the specified encoding cannot represent the character it's probably better to report it as an error: there is something wrong with either the text or the encoding choice.

cheers,
-yves

Received on Wednesday, 27 February 2013 12:50:49 UTC