RE: I18N-ISSUE-246: Clarify character encoding behavior when calculating storage size [ITS-20] from Phillips, Addison on 2013-03-20 (www-international@w3.org from January to March 2013)

From: Phillips, Addison <addison@lab126.com>
Date: Wed, 20 Mar 2013 23:01:58 +0000
To: Felix Sasaki <fsasaki@w3.org>
CC: Arle Lommel <arle.lommel@dfki.de>, Stephan Walter <stephan.walter@cocomore.com>, Norbert Lindenberg <w3@norbertlindenberg.com>, Yves Savourel <ysavourel@enlaso.com>, "public-multilingualweb-lt-comments@w3.org" <public-multilingualweb-lt-comments@w3.org>, "'www-international'" <www-international@w3.org>
Message-ID: <7C0AF84C6D560544A17DDDEB68A9DFB5043FD2@ex10-mbx-31004.ant.amazon.com>
Hi Felix, 

I've tried to stay out of the "storage size" fray. I was making an observation about avoiding normative language, which helps avoid discussions of whether "must" is normative.

However... regarding the "storage size" thing... the point of the feature is to allow the users of ITS to communicate size limits on a text value/token. As it happens, I was just editing the Unicode FAQ on counting things [1] and certainly the considerations presented there would seem to apply here. Byte counts are not the only use case for counting localizable fields.

When it comes to the current usage of "storageSize", there are several points I would make:

You should allow for a Unicode code point count as the default. For UTF-8 this is more useful in many cases.

If a character encoding is specified, then the count is in that character encoding's code units (which might be bytes, but might not be). That is, if "storageEncoding=UTF-16" then storageSize=12 means 12 16-bit code units, not 6. This also keeps silly things like "11 bytes of UTF-16" from happening.

This conversation has revolved, though, around which character encodings should be mandated by ITS. I think Norbert's original comment has some merit: you should consider what encodings you *mandate* are supported (it could be "none"). The current text sets the default to UTF-8, but it isn't clear what this means. I assume it means that, if the storageEncoding is omitted, then the count is in bytes when the content is converted to UTF-8? I would actually tend to make it Unicode code points.

In any case, I would not write the statement you have. "Encouraging" an application to use UTF-8 is no guidance at all. Either the default is UTF-8 or it isn't. If it isn't, then you should probably define what it means when the encoding is omitted. 

If, by contrast, you mean that the storageSize value is an arbitrary numeric value whose interpretation is implementation or user defined, you should probably say exactly that. That is, changing this text:

--
A storageSize attribute. It contains the maximum number of bytes the text of the selected node is allowed in storage.
--

To read more like:

--
A storageSize attribute. It contains the maximum number of units the text of the selected node is allowed in storage. Units are generally code units in a given character encoding. Interpretation is implementation defined.
--

And then later where it says:

--
A storageEncoding attribute. It contains the name of the character set encoding used to calculate the number of bytes of the selected text. The name MUST be one of the names or aliases listed in the IANA Character Sets registry [IANA Character Sets]. The default value is "UTF-8".
--

Something more like:

--
A storageEncoding attribute. It contains the name of the character encoding form used to calculate the number of code units of the selected text (cf. CharMod). The name MUST be one of the names or aliases listed in the IANA Character Sets registry [IANA Character Sets]. If omitted, the character encoding is undefined and the storageSize attribute's interpretation is implementation defined. Interpretation of the storageEncoding attribute is implementation defined.
--

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.

[1] http://www.unicode.org/faq/char_combmark.html#7 

> -----Original Message-----
> From: Felix Sasaki [mailto:fsasaki@w3.org]
> Sent: Wednesday, March 20, 2013 1:17 PM
> To: Phillips, Addison
> Cc: Arle Lommel; Stephan Walter; Norbert Lindenberg; Yves Savourel; public-
> multilingualweb-lt-comments@w3.org; 'www-international'
> Subject: Re: I18N-ISSUE-246: Clarify character encoding behavior when
> calculating storage size [ITS-20]
> 
> Hi Addison, all
> 
> I think we are discussing in circles a bit, and since no new arguments come up
> the threads is "slowing down". Let my try to summarize the issue:
> 
> - the requirement of utf-8 as a default is not something for ITS processors - it is
> relevant for ITS non aware tools. The tools potentially don't know whether the
> "storage size" constraints was created via ITS or something else.
> - we want to give strong guidance about the utf-8 default, but a MUST is
> normally for implementers of a spec - and not for applications built
> independent of the spec.
> - the lower case must guidance does not seem acceptable.
> 
> so we should probably drop the note in the current form and say something
> very general, again in a note, like:
> 
> "An application consuming the storage size information is encouraged to
> assume utf-8 as the default encoding. For ITS 2.0, this statement has no
> normative consequences, since such an application potentially does not know
> anything about ITS 2.0 at all but just consumes the storage size constraints in a
> non ITS 2.0 environment"
> 
> Best,
> 
> Felix
> 
> Am 19.03.13 16:47, schrieb Phillips, Addison:
> > It's not a new normative statement if normative language isn't intended. I
> would avoid putting a normative statement into a "note". Generally, when I'm
> spec writing, I avoid the Magic Normative Words unless I mean them
> normatively. So in this case I read the proposed note text as meaning:
> >
> >> In order to be able to evaluate a Storage Size constraint an
> >> application has to be able to encode...
> > Which is an example of "anti-normative" writing:
> >
> > MAY -> can
> > SHOULD -> ought
> > SHOULD NOT -> ought not, avoid
> > MUST -> has to
> > MUST NOT -> can't, don't
> > RECOMMENDED -> really good idea
> >
> > Addison
> >
> >> -----Original Message-----
> >> From: Arle Lommel [mailto:arle.lommel@dfki.de]
> >> Sent: Tuesday, March 19, 2013 8:22 AM
> >> To: Stephan Walter
> >> Cc: Norbert Lindenberg; Felix Sasaki; Yves Savourel;
> >> public-multilingualweb-lt- comments@w3.org; 'www-international'
> >> Subject: Re: I18N-ISSUE-246: Clarify character encoding behavior when
> >> calculating storage size [ITS-20]
> >>
> >> Stephan,
> >>
> >> I think this sound good. However, as it adds a MUST statement, will
> >> it impact us because it could be seen as a new normative statement?
> >> (I think it is rather a clarification of intent, but just want to
> >> check on it.)
> >>
> >> -Arle
> >>
> >> On 2013 Mar 19, at 06:41 , Stephan Walter
> >> <stephan.walter@cocomore.com>
> >> wrote:
> >>
> >>> Hello,
> >>>
> >>> coming back to  the issue of error handling when storage size is processed.
> >> Would you agree to adding the following note to the definition of the
> >> data category as a resolution:
> >>> NOTE:  In order to be able to evaluate a Storage Size constraint an
> >>> application
> >> must be able to encode the content of the selected nodes in the
> >> specified character encoding. An application that evaluates Storage
> >> Size but does not support the specified character encoding must
> >> report this as an error. If the selected nodes contain characters
> >> that the specified character encoding cannot represent, the processor
> >> must also report this as an error. The application evaluating the Storage Size
> constraint is not necessarily the ITS processor itself.
> >> The constraint may rather be evaluated by applications consuming the
> >> ITS encoded data in later steps. In such cases the above requirement
> >> pertains to those ITS consuming applications.
> >>> Best regards
> >>> Stephan
> >>>
> >>> -----Ursprüngliche Nachricht-----
> >>> Von: Norbert Lindenberg [mailto:w3@norbertlindenberg.com]
> >>> Gesendet: Donnerstag, 28. Februar 2013 08:26
> >>> An: Felix Sasaki
> >>> Cc: Norbert Lindenberg; Yves Savourel; public-multilingualweb-lt-
> >> comments@w3.org; 'www-international'
> >>> Betreff: Re: I18N-ISSUE-246: Clarify character encoding behavior
> >>> when calculating storage size [ITS-20]
> >>>
> >>>
> >>> On Feb 27, 2013, at 14:55 , Felix Sasaki wrote:
> >>>
> >>>> Hi Yves, Norbert, all,
> >>>>
> >>>> Am 27.02.13 13:50, schrieb Yves Savourel:
> >>>>> Hi Norbert,
> >>>>>
> >>>>>>> Note also that we have no way to check conformance of the
> >>>>>>> applications using the ITS data for such mandatory support: ITS
> >>>>>>> processors just pass the data along, they don't act on them (in
> >>>>>>> the case of this data category).
> >>>>>> So who does actually act if a string is too long to fit into the
> >>>>>> specified storage?
> >>>>> There is certainly the case of applications that do process ITS
> >>>>> markup and
> >> apply it to the content directly: For example a JavaScript in an
> >> HTML5 page. But there are also applications that use an ITS processor
> >> to feed the content and the ITS information to a distinct system where the
> information is then applied.
> >> They correspond, for example, to the "Localization Workflow Managers"
> >> described in the "potential users of ITS"[1].
> >>>>> So I think it's important to make the distinction between the 'ITS
> processor'
> >> which act on the markup, and the 'consumer of ITS information' (for
> >> lack of a better name) that applies the ITS information. Both can be
> >> the same application, but they may also be separate ones.
> >>>>> This means a storage-size constraint can be applied completely
> >>>>> outside the original XML/HTML5 document with tools that have no
> >>>>> relations with the ITS processor itself, or with XML/HTML5 for
> >>>>> that matter. Examples of such applications are localization
> >>>>> quality checking tools (like CheckMate, XBench, QA-Distiller,
> >>>>> etc.)
> >>>>>
> >>>>> This is why, from my viewpoint, requiring the 'consumer of ITS
> information'
> >> to support UTF-8 is not important. And I was looking at the case for
> >> consumers that don't have a need for UTF-8, and whether we should
> >> really foist on them such a requirement.
> >>
> >>>>> To answer your question "Do you really want to let systems that
> >>>>> can
> >> represent less than 1% of Unicode advertise themselves as ITS 2.0
> >> conformant?": Why not? If the context where they are utilized is
> >> using only 1% of Unicode, why should they be forced to support more?
> >> I see many customers that never work outside of Latin-1.
> >>>>> This said, supporting UTF-8 is very easy nowadays and promoting
> >>>>> its
> >> support is a good thing too. So in the interest of moving forward and
> >> of promoting better internationalization, I see no problem requiring
> >> the consumer of storage-size to support UTF-8.
> >>>>> The only thing that bother me a little is that such conformance as
> >>>>> well as
> >> the parts about handling errors, apply to the consumer of the ITS
> >> information, not really the ITS processor, and I'm not sure the scope
> >> of our tests can cover that.
> >>>>>
> >>>>> With regards to the error handling:
> >>>>>
> >>>>>> It could be as simple as "If an ITS processor doesn't support the
> >>>>>> specified character encoding, it must report this as an error and
> >>>>>> terminate processing. If the selected nodes contain characters
> >>>>>> that the specified character encoding cannot represent, the
> >>>>>> processor must report this as an error and terminate processing."
> >>>>>> Or you could try and be nice in the second case and specify a
> >>>>>> fallback strategy, e.g., by saying that the first replacement
> >>>>>> character among U+FFFD,
> >>>>>> U+003F,
> >>>>>> U+FF1F that can be represented in the specified character
> >>>>>> U+encoding
> >>>>>> must be used instead of any character that can't.
> >>>>> I would favor a more practical behavior:
> >>>>>
> >>>>> "If the application applying the information doesn't support the
> >>>>> specified
> >> character encoding, it must report this as an error.
> >>>> One important aspect of above sentence is that - as Yves pointed
> >>>> out - the
> >> "must" would be a lower case "must". That is, this will be no
> >> testable assertation of the ITS 2.0 specification, even if the spec
> >> says "the consumer must support UTF-8". In that sense, we might even
> >> put that requirement into a note, to make clear that from the ITS 2.0
> >> point of view this is rather guidance than a normative statement. Would that
> work for you too, Norbert?
> >>> While it's much better if assertions can be and are tested, keep in
> >>> mind that a
> >> test suite generally can't prove that a system fully conforms to a
> >> spec - it can only show in some cases that it doesn't. And even if
> >> this requirement isn't testable by software, there's still the test
> >> of looking into the developer's eyes and asking "does your system
> >> support UTF-8?". Notes are not requirements, so turning this into a note
> would remove the basis for asking the question.
> >>> Norbert
> >>>
> >>>
> >>>
Received on Wednesday, 20 March 2013 23:06:31 UTC