AW: Call for consensus - storageSize and displaySize

Hi Yves, all,

OK. Now I got it! ;-) Thanks Yves for taking the time to explain this to me. 

Cheers

Michael

________________________________________
Dr. Michael Kruppa, Senior IT-Consultant 
Tel.: +49 69 972 69 189 Fax: +49 69 972 69 204; E-Mail: michael.kruppa@cocomore.com 
Cocomore AG, Gutleutstraße 30, D-60329 Frankfurt
Internet: http://www.cocomore.de Facebook: http://www.facebook.com/cocomore Google+: http://plus.cocomore.de

Cocomore ist aktives Mitglied im World Wide Web Consortium (W3C) und im Bundesverband Digitale Wirtschaft (BVDW)
Cocomore is active member of the World Wide Web Consortium (W3C)
Vorstand: Dr. Hans-Ulrich von Freyberg (Vors.), Dr. Jens Fricke, Marc Kutschera, Vors. des Aufsichtsrates: Martin Velasco, Sitz: Frankfurt/Main, Amtsgericht Frankfurt am Main, HRB 51114

dmexco 2012 in Köln: Besuchen Sie unseren Messestand auf der internationalen Leitmesse für die Digitale Wirtschaft am 12. und 13. September 2012. Sie finden uns in Halle 7, Stand E057.
dmexco 2012 in Cologne: Come to see us on September 12 and 13 at the Digital Marketing Exposition and Conference (hall 7, stand E057).


-----Ursprüngliche Nachricht-----
Von: Yves Savourel [mailto:ysavourel@enlaso.com] 
Gesendet: Samstag, 25. August 2012 21:46
An: public-multilingualweb-lt@w3.org
Betreff: RE: Call for consensus - storageSize and displaySize

Hi Michael,

> I'm still confused about storage size. In my
> understanding: If I state a storage size limit in bytes than I'm done.
> I would interpret this limit as: Whatever content you put here, it 
> shall not exceed the maximum number of bytes.
> Whether I use encoding A or B should be irrelevant, since the I have 
> to ensure that my text using my encoding does not exceed the byte 
> limit.
> I think, one would only need the additional encoding attribute if we 
> would base storage on character counts.
> Or is this a totally wrong understanding?

It seems you got it backward: You wouldn't need the encoding if the unit of storage was the character (presumably the Unicode code point), but you do need it when the unit is byte. And for storage one can only use byte as a unit.

Let's say you have a storage field that cannot take more than 11 bytes.
Let's say your original English text is: "It's summer" (11 Unicode code points)

Let's say your file/db/whatever is using UTF-8 to store the field.
"It's summer" gives you:
49,74,27,73,20,73,75,6d,6d,65,72 = 11 bytes.
 
Now we are translating into French. The text is: "C'est l'été" (11 Unicode code points)

In UTF-8 that is encoded as:
43,27,65,73,74,20,6c,27,c3,a9,74,c3,a9 = 13 bytes.
It's too long to fit into your field!

If the encoding used to store the field was ISO-8859-1 we would have:
43,27,65,73,74,20,6c,27,e9,74,e9 = 11 bytes

The difference is the two 'é': in ISO-8859-1 it's encoded in one byte (0xE9), but in UTF-8 it's encoded in two bytes (0xC3,0xA9). That's why we have two 'extra' bytes in UTF-8.

That is why when a tool checks if a given text fits the storage it must know what encoding is used, otherwise it simply cannot calculate it.
 
Those byte/char/encoding-related matters are often confusing, I hope this helps.

Cheers,
-yves

Received on Monday, 27 August 2012 15:29:56 UTC