Re: I18N-ISSUE-246: Clarify character encoding behavior when calculating storage size [ITS-20]

Hi all,

I have been tasked to move this ISSUE forward as it seems little bit
stalled. I propose to resolve and close this issue by adding the
following note at the end of definition section of Storage Size
definition
(http://www.w3.org/International/multilingualweb/lt/drafts/its20/its20.html#storagesize-definition):

[[
NOTE: In order to be able to evaluate a Storage Size
constraint an application encodes the content of
the selected nodes in the specified character encoding. An
application that evaluates Storage Size but does not support the
specified character encoding reports this as an error. If the
selected nodes contain characters that the specified character
encoding cannot represent, the processor reports this as
an error.

The application evaluating the Storage Size constraint
is not necessarily the ITS processor itself. The constraint may
rather be evaluated by applications consuming the ITS encoded
data in later steps. In such cases the above behaviour pertains
to those ITS consuming applications.
]]

Rationale for this change is reflecting hopefully all raised concerns:

-- UTF-8 encoding is a default value of storageEncoding attribute, so
there is no need to repeat this in a note

-- It has been pointed out that storage size mechanism is targeted for
generating output resources that are loaded into various components and
devices. Many of such devices are legacy and do not support UTF-8 so
requiring UTF-8 support will not be aligned with the reality and will be
artificial requirement.

-- As storage size constraints are not necessarily evaluated by ITS
processor it doesn't make sense to use normative language. Informal note
is thus used in order to provide guidance for implementers and normative
language is not used inside note.

-- I dont't think that we should use code units for counting size of
encoded string. Such unit is depending on encoding used and it's not
directly supported by common programming languages and resource formats.
Memory is still counted in bytes/octets.

If you can't live with this resolution please shout and respond until
April, 2nd. ITS WG would like to resolve and close this issue during the
April, 3rd teleconference.

Thanks and have a nice day,

    Jirka



> I think we are discussing in circles a bit, and since no new arguments
> come up the threads is "slowing down". Let my try to summarize the issue:
> 
> - the requirement of utf-8 as a default is not something for ITS
> processors - it is relevant for ITS non aware tools. The tools
> potentially don't know whether the "storage size" constraints was
> created via ITS or something else.
> - we want to give strong guidance about the utf-8 default, but a MUST is
> normally for implementers of a spec - and not for applications built
> independent of the spec.
> - the lower case must guidance does not seem acceptable.
> 
> so we should probably drop the note in the current form and say
> something very general, again in a note, like:
> 
> "An application consuming the storage size information is encouraged to
> assume utf-8 as the default encoding. For ITS 2.0, this statement has no
> normative consequences, since such an application potentially does not
> know anything about ITS 2.0 at all but just consumes the storage size
> constraints in a non ITS 2.0 environment"
> 
> Best,
> 
> Felix
> 
> Am 19.03.13 16:47, schrieb Phillips, Addison:
>> It's not a new normative statement if normative language isn't
>> intended. I would avoid putting a normative statement into a "note".
>> Generally, when I'm spec writing, I avoid the Magic Normative Words
>> unless I mean them normatively. So in this case I read the proposed
>> note text as meaning:
>>
>>> In order to be able to evaluate a Storage Size constraint an application
>>> has to be able to encode...
>> Which is an example of "anti-normative" writing:
>>
>> MAY -> can
>> SHOULD -> ought
>> SHOULD NOT -> ought not, avoid
>> MUST -> has to
>> MUST NOT -> can't, don't
>> RECOMMENDED -> really good idea
>>
>> Addison
>>
>>> -----Original Message-----
>>> From: Arle Lommel [mailto:arle.lommel@dfki.de]
>>> Sent: Tuesday, March 19, 2013 8:22 AM
>>> To: Stephan Walter
>>> Cc: Norbert Lindenberg; Felix Sasaki; Yves Savourel;
>>> public-multilingualweb-lt-
>>> comments@w3.org; 'www-international'
>>> Subject: Re: I18N-ISSUE-246: Clarify character encoding behavior when
>>> calculating storage size [ITS-20]
>>>
>>> Stephan,
>>>
>>> I think this sound good. However, as it adds a MUST statement, will
>>> it impact us
>>> because it could be seen as a new normative statement? (I think it is
>>> rather a
>>> clarification of intent, but just want to check on it.)
>>>
>>> -Arle
>>>
>>> On 2013 Mar 19, at 06:41 , Stephan Walter <stephan.walter@cocomore.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> coming back to  the issue of error handling when storage size is
>>>> processed.
>>> Would you agree to adding the following note to the definition of the
>>> data
>>> category as a resolution:
>>>> NOTE:  In order to be able to evaluate a Storage Size constraint an
>>>> application
>>> must be able to encode the content of the selected nodes in the
>>> specified
>>> character encoding. An application that evaluates Storage Size but
>>> does not
>>> support the specified character encoding must report this as an
>>> error. If the
>>> selected nodes contain characters that the specified character
>>> encoding cannot
>>> represent, the processor must also report this as an error. The
>>> application
>>> evaluating the Storage Size constraint is not necessarily the ITS
>>> processor itself.
>>> The constraint may rather be evaluated by applications consuming the ITS
>>> encoded data in later steps. In such cases the above requirement
>>> pertains to
>>> those ITS consuming applications.
>>>> Best regards
>>>> Stephan
>>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: Norbert Lindenberg [mailto:w3@norbertlindenberg.com]
>>>> Gesendet: Donnerstag, 28. Februar 2013 08:26
>>>> An: Felix Sasaki
>>>> Cc: Norbert Lindenberg; Yves Savourel; public-multilingualweb-lt-
>>> comments@w3.org; 'www-international'
>>>> Betreff: Re: I18N-ISSUE-246: Clarify character encoding behavior when
>>>> calculating storage size [ITS-20]
>>>>
>>>>
>>>> On Feb 27, 2013, at 14:55 , Felix Sasaki wrote:
>>>>
>>>>> Hi Yves, Norbert, all,
>>>>>
>>>>> Am 27.02.13 13:50, schrieb Yves Savourel:
>>>>>> Hi Norbert,
>>>>>>
>>>>>>>> Note also that we have no way to check conformance of the
>>>>>>>> applications using the ITS data for such mandatory support: ITS
>>>>>>>> processors just pass the data along, they don't act on them (in
>>>>>>>> the case of this data category).
>>>>>>> So who does actually act if a string is too long to fit into the
>>>>>>> specified storage?
>>>>>> There is certainly the case of applications that do process ITS
>>>>>> markup and
>>> apply it to the content directly: For example a JavaScript in an
>>> HTML5 page. But
>>> there are also applications that use an ITS processor to feed the
>>> content and
>>> the ITS information to a distinct system where the information is
>>> then applied.
>>> They correspond, for example, to the "Localization Workflow Managers"
>>> described in the "potential users of ITS"[1].
>>>>>> So I think it's important to make the distinction between the 'ITS
>>>>>> processor'
>>> which act on the markup, and the 'consumer of ITS information' (for
>>> lack of a
>>> better name) that applies the ITS information. Both can be the same
>>> application, but they may also be separate ones.
>>>>>> This means a storage-size constraint can be applied completely
>>>>>> outside the original XML/HTML5 document with tools that have no
>>>>>> relations with the ITS processor itself, or with XML/HTML5 for that
>>>>>> matter. Examples of such applications are localization quality
>>>>>> checking tools (like CheckMate, XBench, QA-Distiller, etc.)
>>>>>>
>>>>>> This is why, from my viewpoint, requiring the 'consumer of ITS
>>>>>> information'
>>> to support UTF-8 is not important. And I was looking at the case for
>>> consumers
>>> that don't have a need for UTF-8, and whether we should really foist
>>> on them
>>> such a requirement.
>>>
>>>>>> To answer your question "Do you really want to let systems that can
>>> represent less than 1% of Unicode advertise themselves as ITS 2.0
>>> conformant?": Why not? If the context where they are utilized is
>>> using only 1%
>>> of Unicode, why should they be forced to support more? I see many
>>> customers
>>> that never work outside of Latin-1.
>>>>>> This said, supporting UTF-8 is very easy nowadays and promoting its
>>> support is a good thing too. So in the interest of moving forward and of
>>> promoting better internationalization, I see no problem requiring the
>>> consumer
>>> of storage-size to support UTF-8.
>>>>>> The only thing that bother me a little is that such conformance as
>>>>>> well as
>>> the parts about handling errors, apply to the consumer of the ITS
>>> information,
>>> not really the ITS processor, and I'm not sure the scope of our tests
>>> can cover
>>> that.
>>>>>>
>>>>>> With regards to the error handling:
>>>>>>
>>>>>>> It could be as simple as "If an ITS processor doesn't support the
>>>>>>> specified character encoding, it must report this as an error and
>>>>>>> terminate processing. If the selected nodes contain characters that
>>>>>>> the specified character encoding cannot represent, the processor
>>>>>>> must report this as an error and terminate processing." Or you
>>>>>>> could try and be nice in the second case and specify a fallback
>>>>>>> strategy, e.g., by saying that the first replacement character
>>>>>>> among U+FFFD,
>>>>>>> U+003F,
>>>>>>> U+FF1F that can be represented in the specified character encoding
>>>>>>> must be used instead of any character that can't.
>>>>>> I would favor a more practical behavior:
>>>>>>
>>>>>> "If the application applying the information doesn't support the
>>>>>> specified
>>> character encoding, it must report this as an error.
>>>>> One important aspect of above sentence is that - as Yves pointed
>>>>> out - the
>>> "must" would be a lower case "must". That is, this will be no testable
>>> assertation of the ITS 2.0 specification, even if the spec says "the
>>> consumer
>>> must support UTF-8". In that sense, we might even put that
>>> requirement into a
>>> note, to make clear that from the ITS 2.0 point of view this is
>>> rather guidance
>>> than a normative statement. Would that work for you too, Norbert?
>>>> While it's much better if assertions can be and are tested, keep in
>>>> mind that a
>>> test suite generally can't prove that a system fully conforms to a
>>> spec - it can
>>> only show in some cases that it doesn't. And even if this requirement
>>> isn't
>>> testable by software, there's still the test of looking into the
>>> developer's eyes
>>> and asking "does your system support UTF-8?". Notes are not
>>> requirements, so
>>> turning this into a note would remove the basis for asking the question.
>>>> Norbert
>>>>
>>>>
>>>>
> 
> 

-- 
------------------------------------------------------------------
  Jirka Kosek      e-mail: jirka@kosek.cz      http://xmlguru.cz
------------------------------------------------------------------
       Professional XML consulting and training services
  DocBook customization, custom XSLT/XSL-FO document processing
------------------------------------------------------------------
 OASIS DocBook TC member, W3C Invited Expert, ISO JTC1/SC34 rep.
------------------------------------------------------------------
    Bringing you XML Prague conference    http://xmlprague.cz
------------------------------------------------------------------

Received on Friday, 29 March 2013 10:32:47 UTC