- From: Addison Phillips <addisoni18n@gmail.com>
- Date: Mon, 24 Jul 2023 13:49:10 -0700
- To: "'Editorial Committee'" <edcom@unicode.org>
- Cc: "'Internationalization Working Group'" <public-i18n-core@w3.org>
- Message-ID: <03c501d9be70$4ca9a0a0$e5fce1e0$@gmail.com>
Hello Edcom, I was actioned [1] by the W3C Internationalization WG with letting you know of some issues we found with the Unicode glossary and FAQs while revising some entries in _our_ glossary [3]. Where possible we like to quote the Unicode glossary verbatim rather than inventing our own definitions. Before writing this note, I looked for an appropriate repo to file issues against the glossary, but I didn't find it. I'd be glad of a pointer (both to file these comments in a suitably structured way and for any future issues). The issues we found were: Term: Kana Location: https://unicode.org/glossary/#kana Current Definition: The name of a primarily syllabic script used by the Japanese writing system. It comes in two forms, <https://unicode.org/glossary/#hiragana> hiragana and <https://unicode.org/glossary/#katakana> katakana. The former is used to write particles, grammatical affixes, and words that have no <https://unicode.org/glossary/#kanji> kanji form; the latter is used primarily to write foreign words. We found this definition to be potentially confusing. Generally several of our group think that it would be clearer to say that "Kana" is a collective term for the two syllabic scripts used (along with kanji and romaji) by the Japanese writing system. Also, the usage of katakana is not limited to words of foreign origin and maybe some wording might be used to indicate this. Term: UTF-16 Location: <https://www.unicode.org/faq/utf_bom.html#utf16-1> https://www.unicode.org/faq/utf_bom.html#utf16-1 Current definition: UTF-16 <https://www.unicode.org/glossary/#UTF_16> uses a single 16-bit code unit <https://www.unicode.org/glossary/#code_unit> to encode the most common 63K characters, and a pair of 16-bit code units, called surrogates, to encode the 1M less commonly used characters in Unicode. This definition seems to have a typo in it (it should probably be 64K), although for clarity it should perhaps say 65,525. The "1M less commonly used characters" is also misleading, as not all of these characters are "less commonly used" any more and the number 1M is really close to but not exactly the number of encoded code points for supplementary characters. Could you please have a look at these issues and let me know how best to proceed or if you have any questions? Thanks! Addison (for W3C I18N) [1] https://github.com/w3c/i18n-actions/issues/25 [2] https://www.w3.org/2023/07/20-i18n-minutes.html#t06 [3] https://www.w3.org/TR/i18n-glossary Addison Phillips Chair (W3C Internationalization WG) Internationalization is not a feature. It is an architecture.
Received on Monday, 24 July 2023 20:49:16 UTC