AW: [ISSUE-106][I18N-ISSUE-246] Storage Size - unit from Stephan Walter on 2013-04-02 (www-international@w3.org from April to June 2013)

From: Stephan Walter <stephan.walter@cocomore.com>
Date: Tue, 2 Apr 2013 08:36:01 +0000
To: Yves Savourel <ysavourel@enlaso.com>, "public-multilingualweb-lt-comments@w3.org" <public-multilingualweb-lt-comments@w3.org>, 'www-international' <www-international@w3.org>
Message-ID: <0e78fb4686b447eaad6c3a9be7458f2e@DB3PR05MB041.eurprd05.prod.outlook.com>

Hi all,

I totally agree with Ives. In fact I think we should also include a sentence pointing out that the data category is not intended to encode display-related properties.

Maybe we should add a sentence like: 'Note that the value of storage size will generally give no indication about the display length of the text and is therefore not adequate for expressing constraints relating to display length.' after: 'The storage size is expressed in bytes and is provided along with the character set encoding used to store the content.'

On the other hand using code points instead of bytes would mean that we would be (partly) able to express display-related restrictions, which are quite relevant in practice, I suppose.

Best
Stephan

-----Ursprüngliche Nachricht-----
Von: Yves Savourel [mailto:ysavourel@enlaso.com]
Gesendet: Montag, 1. April 2013 22:17
An: public-multilingualweb-lt-comments@w3.org; 'www-international'
Betreff: [ISSUE-106][I18N-ISSUE-246] Storage Size - unit

Hi all,

During the I18N IG conference call last week I took an action item to come up with a re-worded text for this data category.

But I'd like to settle on the issue of the unit first:

During last week's call Addison convinced me that using only byte as the unit was a bad idea. It seemed logical at the time, but upon further thinking and after looking at changing my implementation, I'm not so sure any more. It looks like byte only is as good of a choice and may be even better. Here are some of the reasons:

If the storage unit dependents on the encoding (e.g. byte for UTF-8, 16-bit code for UTF-16, etc.), the application that performs the verification needs to do a lot more during the check. It needs to apply different type of checks according a list of different encodings: for "UTF-16", "UTF-16BE", "UTF-16LE", etc, it counts the 16-bit chars; for "UTF-8" it counts the bytes; for "UTF-32" it counts 32-bit codes, etc. This means the application needs to know what unit must be used for each encoding. That means hard-coding some if/then.

With using bytes, the verification seems to be a lot more straightforward: a) instantiate a charset-encoder for the specified character set encoding, b) use it to get a byte buffer of the string, c) check the byte count. It's easily done in most programming languages.

Sure, this means the application generating the ITS annotation must make sure the specified size is really in byte and perform the proper multiplication if the selected encoding is 16-bit or 32-bit, but this is something easy to do and mechanical.

It may also make things clearer that this data category is for storage size only, not for display or other text length constraints. We could even make things really clear by changing the name of the attribute and call it byteSize, or maxByteSize, or something like this.

So, what are the advantages of using different units for the size?

Thanks,
-yves

Received on Tuesday, 2 April 2013 08:37:04 UTC