Questions/comments about glossary and FAQ entries (i18n-actions#25) from Addison Phillips on 2023-07-24 (public-i18n-core@w3.org from July to September 2023)

From: Addison Phillips <addisoni18n@gmail.com>
Date: Mon, 24 Jul 2023 13:49:10 -0700
To: "'Editorial Committee'" <edcom@unicode.org>
Cc: "'Internationalization Working Group'" <public-i18n-core@w3.org>
Message-ID: <03c501d9be70$4ca9a0a0$e5fce1e0$@gmail.com>

Hello Edcom,

 

I was actioned [1] by the W3C Internationalization WG with letting you know
of some issues we found with the Unicode glossary and FAQs while revising
some entries in _our_ glossary [3]. Where possible we like to quote the
Unicode glossary verbatim rather than inventing our own definitions.

 

Before writing this note, I looked for an appropriate repo to file issues
against the glossary, but I didn't find it. I'd be glad of a pointer (both
to file these comments in a suitably structured way and for any future
issues).

 

The issues we found were:

 

Term: Kana

Location: https://unicode.org/glossary/#kana

Current Definition: The name of a primarily syllabic script used by the
Japanese writing system. It comes in two forms,
<https://unicode.org/glossary/#hiragana> hiragana and
<https://unicode.org/glossary/#katakana> katakana. The former is used to
write particles, grammatical affixes, and words that have no
<https://unicode.org/glossary/#kanji> kanji form; the latter is used
primarily to write foreign words.

 

We found this definition to be potentially confusing. Generally several of
our group think that it would be clearer to say that "Kana" is a collective
term for the two syllabic scripts used (along with kanji and romaji) by the
Japanese writing system. Also, the usage of katakana is not limited to words
of foreign origin and maybe some wording might be used to indicate this.

 

Term: UTF-16

Location:  <https://www.unicode.org/faq/utf_bom.html#utf16-1>
https://www.unicode.org/faq/utf_bom.html#utf16-1

Current definition: UTF-16 <https://www.unicode.org/glossary/#UTF_16>  uses
a single 16-bit code unit <https://www.unicode.org/glossary/#code_unit>  to
encode the most common 63K characters, and a pair of 16-bit code units,
called surrogates, to encode the 1M less commonly used characters in
Unicode.

 

This definition seems to have a typo in it (it should probably be 64K),
although for clarity it should perhaps say 65,525. The "1M less commonly
used characters" is also misleading, as not all of these characters are
"less commonly used" any more and the number 1M is really close to but not
exactly the number of encoded code points for supplementary characters.

 

Could you please have a look at these issues and let me know how best to
proceed or if you have any questions?

 

Thanks!

 

Addison (for W3C I18N)

 

 

[1] https://github.com/w3c/i18n-actions/issues/25

[2] https://www.w3.org/2023/07/20-i18n-minutes.html#t06

[3] https://www.w3.org/TR/i18n-glossary 

 

Addison Phillips

Chair (W3C Internationalization WG)

 

Internationalization is not a feature.

It is an architecture.

Received on Monday, 24 July 2023 20:49:16 UTC