RE: Questions/comments about glossary and FAQ entries (i18n-actions#25) from Addison Phillips on 2023-07-24 (public-i18n-core@w3.org from July to September 2023)

From: Addison Phillips <addisoni18n@gmail.com>
Date: Mon, 24 Jul 2023 15:41:28 -0700
To: "'Ken Whistler'" <kenwhistler@sonic.net>, "'Markus Scherer'" <markus.icu@gmail.com>
Cc: "'Editorial Committee'" <edcom@unicode.org>, "'Internationalization Working Group'" <public-i18n-core@w3.org>
Message-ID: <043201d9be7f$fc489700$f4d9c500$@gmail.com>

Thanks Ken. I guess we missed the glossary entry for UTF-16 (I didn’t look before composing the email).

 

I will note that the UTF-16 definition says that it is “[a] multibyte encoding…”. I usually think of a multibyte encoding as being a (possibly variable-width) encoding that uses bytes as the code unit. It’s not technically wrong to say that it’s multibyte, since multiple bytes are used, but maybe a few of us recall the “mb-vs-w” distinction in some APIs? The Unicode glossary doesn’t define “multibyte encoding”, nor does Charmod, Encoding, or W3C I18N glossary, so maybe nevermind…..

 

---

 

Is there a better way for me to file bug reports/requests/etc?

 

Addison

 

From: Ken Whistler <kenwhistler@sonic.net> 
Sent: Monday, July 24, 2023 3:10 PM
To: Markus Scherer <markus.icu@gmail.com>; Addison Phillips <addisoni18n@gmail.com>
Cc: Editorial Committee <edcom@unicode.org>; Internationalization Working Group <public-i18n-core@w3.org>
Subject: Re: Questions/comments about glossary and FAQ entries (i18n-actions#25)

 

Markus, Addison,

This particular FAQ item is very old and problematical. A wayback machine snap from June 3, 2004:

"A: UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode."

Yes, "unites" [sic]. ;-) So nobody has really attended much to it for 2 decades now.

As Markus surmised, it should have been 62K characters, not 63K, after subtracting the 2K surrogate code points. But in any case, I conclude on this one:

A. W3C should not be referring to this particular item for anything. The glossary entry is much more correct and aligned with the core specification text:

https://www.unicode.org/glossary/#UTF_16

B. This particular feedback should be tossed to the FAQ group for correction and rewording.

--Ken



On 7/24/2023 1:56 PM, Markus Scherer wrote:

On Mon, Jul 24, 2023 at 1:49 PM Addison Phillips <addisoni18n@gmail.com <mailto:addisoni18n@gmail.com> > wrote:

Term: UTF-16

Location:  <https://www.unicode.org/faq/utf_bom.html#utf16-1> https://www.unicode.org/faq/utf_bom.html#utf16-1

Current definition: UTF-16 <https://www.unicode.org/glossary/#UTF_16>  uses a single 16-bit code unit <https://www.unicode.org/glossary/#code_unit>  to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode.

 

This definition seems to have a typo in it (it should probably be 64K), although for clarity it should perhaps say 65,525. The “1M less commonly used characters” is also misleading, as not all of these characters are “less commonly used” any more and the number 1M is really close to but not exactly the number of encoded code points for supplementary characters.

 

I suspect it might have wanted to be 62k characters, subtracting the 2k surrogate code points. And 65,525 would be off by 11 :-)

Also, these are numbers of code points, not assigned characters, and there are a few thousand PUA etc.etc.

I would remove the numbers and say something like "encode the most commonly used characters" / "... emoji and less commonly used characters, with plenty of room to add future characters".

 

markus

-- 
Edcom mail list: https://groups.google.com/a/unicode.org/g/edcom

Received on Monday, 24 July 2023 22:41:33 UTC