Re: I18N-ISSUE-246: Clarify character encoding behavior when calculating storage size [ITS-20] from Norbert Lindenberg on 2013-02-27 (www-international@w3.org from January to March 2013)

From: Norbert Lindenberg <w3@norbertlindenberg.com>
Date: Wed, 27 Feb 2013 00:26:47 -0800
To: Yves Savourel <ysavourel@enlaso.com>
Cc: Norbert Lindenberg <w3@norbertlindenberg.com>, <public-multilingualweb-lt-comments@w3.org>, "'www-international'" <www-international@w3.org>
Message-Id: <01CF25DE-6883-4D73-906B-1C63204155C8@norbertlindenberg.com>

On Feb 21, 2013, at 8:58 , Yves Savourel wrote:

> Hi Norbert,
> 
> Related to:
> http://lists.w3.org/Archives/Public/public-multilingualweb-lt-comments/2013Feb/0028.html
> 
>> I don't see in your messages any justification why the standard should 
>> not require at least support for UTF-8, and why it should not specify 
>> error handling for commonly occurring situations. Can you please explain?
>> If an application can't rely on any encoding being supported, can't 
>> find out whether a requested encoding is supported, and can't rely 
>> on unsupported characters being handled in some reasonable way, 
>> then using this data category seems quite risky.
> 
> The original comments were:
> 
>> Several aspects of the interpretation of the character encoding 
>> given as storageEncoding need to be clarified:
>> - Which character encodings is an implementation required to support? 
>> Support for UTF-8 must be mandatory.
>> - What's the required behavior if storageEncoding specifies a character
>> encoding that the implementation doesn't support?
>> - What's the required behavior if the selected nodes contain characters 
>> that the specified character encoding cannot represent?
> 
> I'm not sure why support for UTF-8 would be mandatory (I'm not against, but just asking why).
> The encoding is the one used in the store where the data resides and can be anything (resource file, database, etc.), not necessarily some XML-based system. What would be the rational to force support for an arbitrary encoding?.

UTF-8 isn't arbitrary; it's the Unicode encoding most commonly used in files. If databases are involved, we might relax that a bit and require that any implementation support at least one of UTF-8, UTF-16, or UTF-32. I think allowing implementations to support no Unicode encoding at all and risk data loss is no longer acceptable. If this were an IETF standard, I'd point to RFC 2277; the W3 Character Model isn't quite as strongly worded.

> One can imagine a user having the data stored in Latin-1, the data extracted to some XML export format (in UTF-8) where the storage size encoding would be set to iso-8859-1 and his checking tool supporting only that encoding. Why would such user have to implement support for UTF-8 if he doesn't use it?

Do you really want to let systems that can represent less than 1% of Unicode advertise themselves as ITS 2.0 conformant?

> Note also that we have no way to check conformance of the applications using the ITS data for such mandatory support: ITS processors just pass the data along, they don't act on them (in the case of this data category).

So who does actually act if a string is too long to fit into the specified storage?

Norbert

Received on Wednesday, 27 February 2013 08:27:16 UTC