RE: Call for consensus - storageSize and displaySize

Hi Michael,

> I'm still confused about storage size. In my 
> understanding: If I state a storage size limit 
> in bytes than I'm done. 
> I would interpret this limit as: Whatever content 
> you put here, it shall not exceed the maximum number of bytes.
> Whether I use encoding A or B should be irrelevant, 
> since the I have to ensure that my text using my 
> encoding does not exceed the byte limit. 
> I think, one would only need the additional encoding 
> attribute if we would base storage on character counts.
> Or is this a totally wrong understanding?

It seems you got it backward: You wouldn't need the encoding if the unit of storage was the character (presumably the Unicode code point), but you do need it when the unit is byte. And for storage one can only use byte as a unit.

Let's say you have a storage field that cannot take more than 11 bytes.
Let's say your original English text is: "It's summer" (11 Unicode code points)

Let's say your file/db/whatever is using UTF-8 to store the field.
"It's summer" gives you:
49,74,27,73,20,73,75,6d,6d,65,72 = 11 bytes.
 
Now we are translating into French. The text is: "C'est l'été" (11 Unicode code points)

In UTF-8 that is encoded as:
43,27,65,73,74,20,6c,27,c3,a9,74,c3,a9 = 13 bytes.
It's too long to fit into your field!

If the encoding used to store the field was ISO-8859-1 we would have:
43,27,65,73,74,20,6c,27,e9,74,e9 = 11 bytes

The difference is the two 'é': in ISO-8859-1 it's encoded in one byte (0xE9), but in UTF-8 it's encoded in two bytes (0xC3,0xA9). That's why we have two 'extra' bytes in UTF-8.

That is why when a tool checks if a given text fits the storage it must know what encoding is used, otherwise it simply cannot calculate it.
 
Those byte/char/encoding-related matters are often confusing, I hope this helps.

Cheers,
-yves

Received on Saturday, 25 August 2012 19:46:23 UTC