Re: For review: Character encodings for beginners from Asmus Freytag on 2007-12-11 (www-international@w3.org from October to December 2007)

From: Asmus Freytag <asmusf@ix.netcom.com>
Date: Tue, 11 Dec 2007 01:31:38 -0800
To: Martin Duerst <duerst@it.aoyama.ac.jp>, Richard Ishida <ishida@w3.org>
CC: www-international@w3.org
Message-ID: <475E58FA.7080909@ix.netcom.com>
On 12/11/2007 1:00 AM, Martin Duerst wrote:
>   
>>> | Most Web pages use the UTF-8 encoding for Unicode text.
>>>       
>> [...] 
>>     
>>> Are you sure about "most Web pages" (as of today) ?
>>>       
>> This evoked a double take from me, too.  I had to re-read to see that
>> "for Unicode text" was making a much smaller claim than I first thought.
>> In the sense in which it is meant, however (UTF-8 is more common than
>> UTF-[7,16,32] variants), it seems very likely true.
>>     
>
> Somewhat similar for me, too. I'm sure that we can tweak the wording
> so that it's easier to read.
>   
I was going to suggest:

"UTF-8 is the most widely used way to represent Unicode text in web pages."

but then I looked at the original text.

OK, now try the same thing in context:

The existing paragraph:

"Other character sets use a more complicated approach. With the Unicode 
character set, which covers most characters you are likely to need to 
use in a single set, that same Cyrillic character щ has a codepoint 
value of 1097. This is too high a number to be represented by a single 
byte. Most Web pages use the UTF-8 encoding for Unicode text. In that 
encoding щ <images/1097.png> will be represented by two bytes, but the 
codepoint value is not simply derived from the value of the two bytes - 
some more complicated decoding is needed. Other Unicode characters map 
to one, three or four bytes in the UTF-8 encoding."

The paragraph annotated:

"Other character sets use a more complicated approach.

After you've just described how confusion reigns with context dependent 
single bytes, I wouldn't use "complicated" here.

=> Other character sets use a more unified approach.

"With the Unicode character set, which covers most characters you are 
likely to need to use in a single set, that same Cyrillic character щ 
has a codepoint value of 1097.

Make the point that the same character set actually contains both.

=> With the Unicode character set, you can represent both characters - 
while the value of 233 still represent the é the Cyrillic character now 
щ has a different codepoint value of 1097. <images/233.png>

"This is too high a number to be represented by a single byte.

add,

=+ It can take up to four bytes per character to cover all characters in 
Unicode, because Unicode contains covers most characters you are likely 
to ever need in a single set. There are several encodings that can 
represent Unicode text.

"Most Web pages use the UTF-8 encoding for Unicode text.

with the addition you can actually leave the sentence as is, or you can 
tweak it

""The most widely used way to represent Unicode text in web pages is 
called UTF-8."

The remainder of the paragraph is fine.

"In that encoding щ <images/1097.png> will be represented by two bytes, 
but the codepoint value is not simply derived from the value of the two 
bytes - some more complicated decoding is needed. Other Unicode 
characters map to one, three or four bytes in the UTF-8 encoding."

The paragraph consolidated (and minor tweaks added):

"Other character sets use a more unified approach. For example, with the 
Unicode character set, you can represent both characters in the same set 
. While the value of 233 still represents the é, the Cyrillic character 
щ <images/233.png>now <images/233.png>has a different codepoint value of 
1097. <images/233.png> This is too high a number to be represented by a 
single byte; it can take up to four bytes per character to cover all 
characters in Unicode, because Unicode contains covers most characters 
you are likely to ever need in a single set. There are several encodings 
that can represent Unicode text. Most Web pages use the UTF-8 encoding 
for Unicode text. In that encoding щ <images/1097.png> will be 
represented by two bytes, but the codepoint value is not simply derived 
from the value of the two bytes - some more complicated decoding is 
needed. Other Unicode characters map to one, three or four bytes in the 
UTF-8 encoding."

A./
Received on Tuesday, 11 December 2007 09:32:01 UTC