RE: [ISSUE-106][I18N-ISSUE-246] Storage Size - unit from Pablo Nieto Caride on 2013-04-02 (public-multilingualweb-lt-comments@w3.org from April 2013)

From: Pablo Nieto Caride <pablo.nieto@linguaserve.com>
Date: Tue, 2 Apr 2013 09:47:01 +0200
To: "'Yves Savourel'" <ysavourel@enlaso.com>, <public-multilingualweb-lt-comments@w3.org>, "'www-international'" <www-international@w3.org>
Message-ID: <038a01ce2f76$43333980$c999ac80$@linguaserve.com>

Hi Yves,

+1 to what you say, it's simpler and you avoid some hard-coding, the only advantage of using different units that occurs to me, is that it's easier for a human annotator to specify 1MB than 1048576 bytes.

Cheers,
Pablo.
_____________________________________________________

Hi all,

During the I18N IG conference call last week I took an action item to come up with a re-worded text for this data category.

But I'd like to settle on the issue of the unit first:

During last week's call Addison convinced me that using only byte as the unit was a bad idea. It seemed logical at the time, but upon further thinking and after looking at changing my implementation, I'm not so sure any more. It looks like byte only is as good of a choice and may be even better. Here are some of the reasons:

If the storage unit dependents on the encoding (e.g. byte for UTF-8, 16-bit code for UTF-16, etc.), the application that performs the verification needs to do a lot more during the check. It needs to apply different type of checks according a list of different encodings: for "UTF-16", "UTF-16BE", "UTF-16LE", etc, it counts the 16-bit chars; for "UTF-8" it counts the bytes; for "UTF-32" it counts 32-bit codes, etc. This means the application needs to know what unit must be used for each encoding. That means hard-coding some if/then.

With using bytes, the verification seems to be a lot more straightforward: a) instantiate a charset-encoder for the specified character set encoding, b) use it to get a byte buffer of the string, c) check the byte count. It's easily done in most programming languages.

Sure, this means the application generating the ITS annotation must make sure the specified size is really in byte and perform the proper multiplication if the selected encoding is 16-bit or 32-bit, but this is something easy to do and mechanical.

It may also make things clearer that this data category is for storage size only, not for display or other text length constraints. We could even make things really clear by changing the name of the attribute and call it byteSize, or maxByteSize, or something like this.

So, what are the advantages of using different units for the size?

Thanks,
-yves

Received on Tuesday, 2 April 2013 07:47:29 UTC