Re: I18N-ISSUE-246: Clarify character encoding behavior when calculating storage size [ITS-20] from Yves Savourel on 2013-02-21 (www-international@w3.org from January to March 2013)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Thu, 21 Feb 2013 09:58:49 -0700
To: <public-multilingualweb-lt-comments@w3.org>
CC: "'www-international'" <www-international@w3.org>
Message-ID: <002501ce1054$b8f9a9b0$2aecfd10$@com>

Hi Norbert,

Related to:
http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Feb/0028.html

> I don't see in your messages any justification why the standard should 
> not require at least support for UTF-8, and why it should not specify 
> error handling for commonly occurring situations. Can you please explain?
> If an application can't rely on any encoding being supported, can't 
> find out whether a requested encoding is supported, and can't rely 
> on unsupported characters being handled in some reasonable way, 
> then using this data category seems quite risky.

The original comments were:

> Several aspects of the interpretation of the character encoding 
> given as storageEncoding need to be clarified:
> - Which character encodings is an implementation required to support? 
> Support for UTF-8 must be mandatory.
> - What's the required behavior if storageEncoding specifies a character
> encoding that the implementation doesn't support?
> - What's the required behavior if the selected nodes contain characters 
> that the specified character encoding cannot represent?

I'm not sure why support for UTF-8 would be mandatory (I'm not against, but just asking why).
The encoding is the one used in the store where the data resides and can be anything (resource file, database, etc.), not necessarily some XML-based system. What would be the rational to force support for an arbitrary encoding?.

One can imagine a user having the data stored in Latin-1, the data extracted to some XML export format (in UTF-8) where the storage size encoding would be set to iso-8859-1 and his checking tool supporting only that encoding. Why would such user have to implement support for UTF-8 if he doesn't use it?

Note also that we have no way to check conformance of the applications using the ITS data for such mandatory support: ITS processors just pass the data along, they don't act on them (in the case of this data category).

The next two points are useful. The specification could be clearer and have some behavior associated with the two use cases. I would say: in both cases the application should generate an error and continues.

Is this helping?
-yves

Received on Thursday, 21 February 2013 16:59:23 UTC