- From: Asmus Freytag <asmusf@ix.netcom.com>
- Date: Tue, 11 Dec 2007 01:31:38 -0800
- To: Martin Duerst <duerst@it.aoyama.ac.jp>, Richard Ishida <ishida@w3.org>
- CC: www-international@w3.org
On 12/11/2007 1:00 AM, Martin Duerst wrote: > >>> | Most Web pages use the UTF-8 encoding for Unicode text. >>> >> [...] >> >>> Are you sure about "most Web pages" (as of today) ? >>> >> This evoked a double take from me, too. I had to re-read to see that >> "for Unicode text" was making a much smaller claim than I first thought. >> In the sense in which it is meant, however (UTF-8 is more common than >> UTF-[7,16,32] variants), it seems very likely true. >> > > Somewhat similar for me, too. I'm sure that we can tweak the wording > so that it's easier to read. > I was going to suggest: "UTF-8 is the most widely used way to represent Unicode text in web pages." but then I looked at the original text. OK, now try the same thing in context: The existing paragraph: "Other character sets use a more complicated approach. With the Unicode character set, which covers most characters you are likely to need to use in a single set, that same Cyrillic character щ has a codepoint value of 1097. This is too high a number to be represented by a single byte. Most Web pages use the UTF-8 encoding for Unicode text. In that encoding щ <images/1097.png> will be represented by two bytes, but the codepoint value is not simply derived from the value of the two bytes - some more complicated decoding is needed. Other Unicode characters map to one, three or four bytes in the UTF-8 encoding." The paragraph annotated: "Other character sets use a more complicated approach. After you've just described how confusion reigns with context dependent single bytes, I wouldn't use "complicated" here. => Other character sets use a more unified approach. "With the Unicode character set, which covers most characters you are likely to need to use in a single set, that same Cyrillic character щ has a codepoint value of 1097. Make the point that the same character set actually contains both. => With the Unicode character set, you can represent both characters - while the value of 233 still represent the é the Cyrillic character now щ has a different codepoint value of 1097. <images/233.png> "This is too high a number to be represented by a single byte. add, =+ It can take up to four bytes per character to cover all characters in Unicode, because Unicode contains covers most characters you are likely to ever need in a single set. There are several encodings that can represent Unicode text. "Most Web pages use the UTF-8 encoding for Unicode text. with the addition you can actually leave the sentence as is, or you can tweak it ""The most widely used way to represent Unicode text in web pages is called UTF-8." The remainder of the paragraph is fine. "In that encoding щ <images/1097.png> will be represented by two bytes, but the codepoint value is not simply derived from the value of the two bytes - some more complicated decoding is needed. Other Unicode characters map to one, three or four bytes in the UTF-8 encoding." The paragraph consolidated (and minor tweaks added): "Other character sets use a more unified approach. For example, with the Unicode character set, you can represent both characters in the same set . While the value of 233 still represents the é, the Cyrillic character щ <images/233.png>now <images/233.png>has a different codepoint value of 1097. <images/233.png> This is too high a number to be represented by a single byte; it can take up to four bytes per character to cover all characters in Unicode, because Unicode contains covers most characters you are likely to ever need in a single set. There are several encodings that can represent Unicode text. Most Web pages use the UTF-8 encoding for Unicode text. In that encoding щ <images/1097.png> will be represented by two bytes, but the codepoint value is not simply derived from the value of the two bytes - some more complicated decoding is needed. Other Unicode characters map to one, three or four bytes in the UTF-8 encoding." A./
Received on Tuesday, 11 December 2007 09:32:01 UTC