Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets): W3C I18N Review from Addison Phillips on 2025-02-10 (public-i18n-core@w3.org from January to March 2025)

From: Addison Phillips <addisoni18n@gmail.com>
Date: Mon, 10 Feb 2025 13:08:49 -0800
To: last-call@ietf.org
Message-ID: <a72776c0-b9d1-4808-9407-4f8d5cf780fe@gmail.com>

All,

The W3C Internationalization Working Group (of which I am chair) was
requested to review several IETF documents nearing or in IETF Last Call.

This email represents the issues our working group noticed in our review of:

https://datatracker.ietf.org/doc/draft-bray-unichars/

I have some concerns about the purpose of this I-D. There are a lot of
documents in various standards bodies trying to address similar issues.
I think harmonization of these types of documents is strongly desirable.

---

The specific issues our group identified are tracked in github here:

https://github.com/w3c/i18n-activity/issues?q=is%3Aissue%20state%3Aopen%20label%3As%3Aunichars

---

Here are the comments:

#1980: Quibbles about characters and code points

https://datatracker.ietf.org/doc/draft-bray-unichars/

There are 1,114,112 code points; as of Unicode 15.1 (2023), fewer
than 150,000 have been assigned to characters. It is difficult to
specify that unassigned code points should be avoided because they
regularly become assigned when new characters are added to Unicode.

Section 2 of the I-D provides a description of characters and code
points for local use in the document. The above quoted paragraph might
be improved by:

* mention the hex size of the code point space (0x10FFFF or 0x10FFFD
if you prefer) next to or instead of the weird decimal number.
* the phrase "It is difficult to specify that unassigned code points
should be avoided" understates the problem. We explicitly do not
want to forbid unassigned code points that later do become assigned.

---

#1981: "Transformation Formats" might be clearer as "character encoding"?

Unicode describes a variety of "transformation formats", ways to
marshal code points into byte sequences. A survey of transformation
formats is beyond the scope of this document. However, it is useful
to note that the "UTF-16" format represents each code point with one
or two 16-bit chunks, and the "UTF-8" format uses variable-length
byte sequences.

Section 2.1 is labelled "Transformation Formats" and uses that term
instead of the more familiar "character encoding" or "character encoding
form". It is the case that "UTF" stands for "Unicode Transformation
Format" and is part of the name of Unicode's character encodings, but
that seems like a good footnote rather than something to be used in general.

---

#1982: C1 controls, Unicode line endings

Section 2.2.2 introduces control codes and talks specifically about the
C0 controls. The C1 controls are mentioned /en passant/ in section 3:

The value of the "example" field contains the C0 control NUL, the C1
control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired...

... but not elsewhere. Possibly the C1 controls should be dealt with in
2.2.2?

Also, the poorly supported U+2028/2029 line endings aren't mentioned.

---

#1983: Replacement character examples

replacing problematic code points, ideally with "�" (U+FFFD,
REPLACEMENT CHARACTER), although some popular software platforms,
notably Java, use "?".

This is probably incorrect. Java replaces with U+FFFD in most Unicode
processing (including decoding from legacy encodings). (Encoding /to
/legacy encodings in Java use "?"). There are other places, such as
certain browsers, where "?" is used in a Unicode context.

Note that there exist common coders that use the control character
U+001A (SUB) as a replacement character for some legacy encodings.

---

#1984: Security consideration statement perhaps too bold?

Note that the Unicode-character subsets specified in this document
include a successively-decreasing number of problematic code points,
and thus should be less and less susceptible to vulnerabilities. The
Section 4.3 subset, "Unicode Assignables", excludes all of them.

Saying that the Section 4.3 subset excludes "all of them" suggests that
no exploits remain. The preceding paragraph mentions RFC8264's security
considerations applies here also, and that document is somewhat
thorough. Since homographs
<https://www.w3.org/TR/charmod-norm/#normalizationLimitations> cannot be
eliminated, maybe this should say something slightly different? Perhaps:

Note that the Unicode-character subsets specified in this document
successively exclude an increasing number of problematic code points,
and thus should be less and less susceptible to many of these exploits.
The Section 4.3 subset, "Unicode Assignables", excludes all of the
functionally problematic code points.

I should mention, however, that UTS#55 probably should be
mentioned/considered. "Trojan Source" attacks using bidi formatting
characters can affect protocol text and document formats. This is
probably a gap that needs mentioning. Mentioning homographs and
confusables is probably worth a couple of words?

=== end of comments ===

I'll note that this is the first W3C I18N review of an IETF last call
(in recent memory). We track our issues in github, which is heavily
tailored to the W3C Process and tools. I am aware that this is
incompatible with IETF's processes and tooling. I apologize in advance
for any inconvenience that my providing comments might cause and invite
feedback on how we can do better.

Also, for visibility, I have blindcopied (to avoid cross-posting issues)
this message to our public list
(https://lists.w3.org/Archives/Public/public-i18n-core/)

Regards (for W3C I18N),

Addison

--
Addison Phillips
Chair (W3C Internationalization WG)

Internationalization is not a feature.
It is an architecture.

Received on Monday, 10 February 2025 21:08:56 UTC