Re: I18N-ISSUE-246: Clarify character encoding behavior when calculating storage size [ITS-20] from Felix Sasaki on 2013-03-20 (www-international@w3.org from January to March 2013)

From: Felix Sasaki <fsasaki@w3.org>
Date: Wed, 20 Mar 2013 21:17:20 +0100
To: "Phillips, Addison" <addison@lab126.com>
CC: Arle Lommel <arle.lommel@dfki.de>, Stephan Walter <stephan.walter@cocomore.com>, Norbert Lindenberg <w3@norbertlindenberg.com>, Yves Savourel <ysavourel@enlaso.com>, "public-multilingualweb-lt-comments@w3.org" <public-multilingualweb-lt-comments@w3.org>, 'www-international' <www-international@w3.org>
Message-ID: <514A1950.3050903@w3.org>
Hi Addison, all

I think we are discussing in circles a bit, and since no new arguments 
come up the threads is "slowing down". Let my try to summarize the issue:

- the requirement of utf-8 as a default is not something for ITS 
processors - it is relevant for ITS non aware tools. The tools 
potentially don't know whether the "storage size" constraints was 
created via ITS or something else.
- we want to give strong guidance about the utf-8 default, but a MUST is 
normally for implementers of a spec - and not for applications built 
independent of the spec.
- the lower case must guidance does not seem acceptable.

so we should probably drop the note in the current form and say 
something very general, again in a note, like:

"An application consuming the storage size information is encouraged to 
assume utf-8 as the default encoding. For ITS 2.0, this statement has no 
normative consequences, since such an application potentially does not 
know anything about ITS 2.0 at all but just consumes the storage size 
constraints in a non ITS 2.0 environment"

Best,

Felix

Am 19.03.13 16:47, schrieb Phillips, Addison:
> It's not a new normative statement if normative language isn't intended. I would avoid putting a normative statement into a "note". Generally, when I'm spec writing, I avoid the Magic Normative Words unless I mean them normatively. So in this case I read the proposed note text as meaning:
>
>> In order to be able to evaluate a Storage Size constraint an application
>> has to be able to encode...
> Which is an example of "anti-normative" writing:
>
> MAY -> can
> SHOULD -> ought
> SHOULD NOT -> ought not, avoid
> MUST -> has to
> MUST NOT -> can't, don't
> RECOMMENDED -> really good idea
>
> Addison
>
>> -----Original Message-----
>> From: Arle Lommel [mailto:arle.lommel@dfki.de]
>> Sent: Tuesday, March 19, 2013 8:22 AM
>> To: Stephan Walter
>> Cc: Norbert Lindenberg; Felix Sasaki; Yves Savourel; public-multilingualweb-lt-
>> comments@w3.org; 'www-international'
>> Subject: Re: I18N-ISSUE-246: Clarify character encoding behavior when
>> calculating storage size [ITS-20]
>>
>> Stephan,
>>
>> I think this sound good. However, as it adds a MUST statement, will it impact us
>> because it could be seen as a new normative statement? (I think it is rather a
>> clarification of intent, but just want to check on it.)
>>
>> -Arle
>>
>> On 2013 Mar 19, at 06:41 , Stephan Walter <stephan.walter@cocomore.com>
>> wrote:
>>
>>> Hello,
>>>
>>> coming back to  the issue of error handling when storage size is processed.
>> Would you agree to adding the following note to the definition of the data
>> category as a resolution:
>>> NOTE:  In order to be able to evaluate a Storage Size constraint an application
>> must be able to encode the content of the selected nodes in the specified
>> character encoding. An application that evaluates Storage Size but does not
>> support the specified character encoding must report this as an error. If the
>> selected nodes contain characters that the specified character encoding cannot
>> represent, the processor must also report this as an error. The application
>> evaluating the Storage Size constraint is not necessarily the ITS processor itself.
>> The constraint may rather be evaluated by applications consuming the ITS
>> encoded data in later steps. In such cases the above requirement pertains to
>> those ITS consuming applications.
>>> Best regards
>>> Stephan
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: Norbert Lindenberg [mailto:w3@norbertlindenberg.com]
>>> Gesendet: Donnerstag, 28. Februar 2013 08:26
>>> An: Felix Sasaki
>>> Cc: Norbert Lindenberg; Yves Savourel; public-multilingualweb-lt-
>> comments@w3.org; 'www-international'
>>> Betreff: Re: I18N-ISSUE-246: Clarify character encoding behavior when
>>> calculating storage size [ITS-20]
>>>
>>>
>>> On Feb 27, 2013, at 14:55 , Felix Sasaki wrote:
>>>
>>>> Hi Yves, Norbert, all,
>>>>
>>>> Am 27.02.13 13:50, schrieb Yves Savourel:
>>>>> Hi Norbert,
>>>>>
>>>>>>> Note also that we have no way to check conformance of the
>>>>>>> applications using the ITS data for such mandatory support: ITS
>>>>>>> processors just pass the data along, they don't act on them (in
>>>>>>> the case of this data category).
>>>>>> So who does actually act if a string is too long to fit into the
>>>>>> specified storage?
>>>>> There is certainly the case of applications that do process ITS markup and
>> apply it to the content directly: For example a JavaScript in an HTML5 page. But
>> there are also applications that use an ITS processor to feed the content and
>> the ITS information to a distinct system where the information is then applied.
>> They correspond, for example, to the "Localization Workflow Managers"
>> described in the "potential users of ITS"[1].
>>>>> So I think it's important to make the distinction between the 'ITS processor'
>> which act on the markup, and the 'consumer of ITS information' (for lack of a
>> better name) that applies the ITS information. Both can be the same
>> application, but they may also be separate ones.
>>>>> This means a storage-size constraint can be applied completely
>>>>> outside the original XML/HTML5 document with tools that have no
>>>>> relations with the ITS processor itself, or with XML/HTML5 for that
>>>>> matter. Examples of such applications are localization quality
>>>>> checking tools (like CheckMate, XBench, QA-Distiller, etc.)
>>>>>
>>>>> This is why, from my viewpoint, requiring the 'consumer of ITS information'
>> to support UTF-8 is not important. And I was looking at the case for consumers
>> that don't have a need for UTF-8, and whether we should really foist on them
>> such a requirement.
>>
>>>>> To answer your question "Do you really want to let systems that can
>> represent less than 1% of Unicode advertise themselves as ITS 2.0
>> conformant?": Why not? If the context where they are utilized is using only 1%
>> of Unicode, why should they be forced to support more? I see many customers
>> that never work outside of Latin-1.
>>>>> This said, supporting UTF-8 is very easy nowadays and promoting its
>> support is a good thing too. So in the interest of moving forward and of
>> promoting better internationalization, I see no problem requiring the consumer
>> of storage-size to support UTF-8.
>>>>> The only thing that bother me a little is that such conformance as well as
>> the parts about handling errors, apply to the consumer of the ITS information,
>> not really the ITS processor, and I'm not sure the scope of our tests can cover
>> that.
>>>>>
>>>>> With regards to the error handling:
>>>>>
>>>>>> It could be as simple as "If an ITS processor doesn't support the
>>>>>> specified character encoding, it must report this as an error and
>>>>>> terminate processing. If the selected nodes contain characters that
>>>>>> the specified character encoding cannot represent, the processor
>>>>>> must report this as an error and terminate processing." Or you
>>>>>> could try and be nice in the second case and specify a fallback
>>>>>> strategy, e.g., by saying that the first replacement character
>>>>>> among U+FFFD,
>>>>>> U+003F,
>>>>>> U+FF1F that can be represented in the specified character encoding
>>>>>> must be used instead of any character that can't.
>>>>> I would favor a more practical behavior:
>>>>>
>>>>> "If the application applying the information doesn't support the specified
>> character encoding, it must report this as an error.
>>>> One important aspect of above sentence is that - as Yves pointed out - the
>> "must" would be a lower case "must". That is, this will be no testable
>> assertation of the ITS 2.0 specification, even if the spec says "the consumer
>> must support UTF-8". In that sense, we might even put that requirement into a
>> note, to make clear that from the ITS 2.0 point of view this is rather guidance
>> than a normative statement. Would that work for you too, Norbert?
>>> While it's much better if assertions can be and are tested, keep in mind that a
>> test suite generally can't prove that a system fully conforms to a spec - it can
>> only show in some cases that it doesn't. And even if this requirement isn't
>> testable by software, there's still the test of looking into the developer's eyes
>> and asking "does your system support UTF-8?". Notes are not requirements, so
>> turning this into a note would remove the basis for asking the question.
>>> Norbert
>>>
>>>
>>>
Received on Wednesday, 20 March 2013 20:17:51 UTC