[ISSUE-106][I18N-ISSUE-246] Storage Size - unit from Yves Savourel on 2013-04-01 (public-multilingualweb-lt-comments@w3.org from April 2013)

From: Yves Savourel <ysavourel@enlaso.com>
Date: Mon, 1 Apr 2013 14:16:46 -0600
To: <public-multilingualweb-lt-comments@w3.org>, "'www-international'" <www-international@w3.org>
Message-ID: <004101ce2f15$d5eed1a0$81cc74e0$@com>

Hi all,

During the I18N IG conference call last week I took an action item to come up with a re-worded text for this data category.

But I'd like to settle on the issue of the unit first:

During last week's call Addison convinced me that using only byte as the unit was a bad idea. It seemed logical at the time, but upon further thinking and after looking at changing my implementation, I'm not so sure any more. It looks like byte only is as good of a choice and may be even better. Here are some of the reasons:

If the storage unit dependents on the encoding (e.g. byte for UTF-8, 16-bit code for UTF-16, etc.), the application that performs the verification needs to do a lot more during the check. It needs to apply different type of checks according a list of different encodings: for "UTF-16", "UTF-16BE", "UTF-16LE", etc, it counts the 16-bit chars; for "UTF-8" it counts the bytes; for "UTF-32" it counts 32-bit codes, etc. This means the application needs to know what unit must be used for each encoding. That means hard-coding some if/then.

With using bytes, the verification seems to be a lot more straightforward: a) instantiate a charset-encoder for the specified character set encoding, b) use it to get a byte buffer of the string, c) check the byte count. It's easily done in most programming languages.

Sure, this means the application generating the ITS annotation must make sure the specified size is really in byte and perform the proper multiplication if the selected encoding is 16-bit or 32-bit, but this is something easy to do and mechanical.

It may also make things clearer that this data category is for storage size only, not for display or other text length constraints. We could even make things really clear by changing the name of the attribute and call it byteSize, or maxByteSize, or something like this.

So, what are the advantages of using different units for the size?

Thanks,
-yves

Received on Monday, 1 April 2013 20:18:17 UTC