[Last-Call]: <draft-bormann-dispatch-modern-network-unicode-05> (Modern Network Unicode): W3C I18N Review from Addison Phillips on 2025-02-10 (public-i18n-core@w3.org from January to March 2025)

From: Addison Phillips <addisoni18n@gmail.com>
Date: Mon, 10 Feb 2025 14:27:30 -0800
To: last-call@ietf.org
Message-ID: <caf2e949-b97b-4c2f-8032-5ea4e08cd0d6@gmail.com>

All,

The W3C Internationalization Working Group (of which I am chair) was
requested to review several IETF documents nearing or in IETF Last Call.

This email represents the issues our working group noticed in our review of:

https://datatracker.ietf.org/doc/draft-bormann-dispatch-modern-network-unicode/

---

The specific issues our group identified are tracked in github here:

https://github.com/w3c/i18n-activity/issues?q=is%3Aissue%20state%3Aopen%20label%3As%3Amodern-network-unicode

---

Here are the comments:

#1972: Exclude other non-characters

Section 2, point number 4:
https://datatracker.ietf.org/doc/draft-bormann-dispatch-modern-network-unicode/

4. The code points U+FFFE and U+FFFF MUST NOT be used. Also, Byte
Order Marks (leading U+FEFF characters) MUST NOT be used.

This should probably exclude non-character code points at the end of
each supplementary plane (e.g. U+1FFFE, U+2FFFF, U+10FFFE, usw.)

---

#1973: Relationship to CRLF line endings

https://datatracker.ietf.org/doc/draft-bormann-dispatch-modern-network-unicode/

Section 3 disallows CR in "2D MNU" (line-based Unicode text). Section 5
allows specs to define various variances that include CR and CRLF line
feeds. Disallowing CRLF rather than supporting it adaptively seems like
it would create a lot of uncertainty.

---

#1974: "With NFKC" variant considered harmful

Section 5.7 defines a "With NFKC" variant.

This is probably a Bad Idea.

NFKC is destructive and also might be incomplete in accomplishing
something useful. Mentioning the K forms is probably fine, but by not
defining this, one could stay away from the problems it produces. Note
that W3C has this note in charmod-norm <https://www.w3.org/TR/charmod-norm>:

Unicode compatibility decomposition removes meaning from the text
that it is applied to. That means that this normalization step
produces the most promiscuous matches. Some developers and
specification authors find this level of normalization attractive
because it appears to bring together many strings that are logically
similar, but this level of normalization has limited utility in
actual practice and has side effects that confuse users. This
normalization step is presented for completeness, but it is not
generally appropriate for use on the Web.

---

#1975: Link and create harmony between this doc and W3C document
"charmod-norm"

W3C has a document whose short name (for historical reasons) is
"charmod-norm" and whose title is "Character Model for the World Wide
Web: String Matching". See: https://www.w3.org/TR/charmod-norm/. These
documents have some similarity of content (there is also a similarity to
PRECIS). It might be a good idea to cross-link this document and
charmod-norm and ensure consistency when there is overlap.

---

#1976: Missing 'character encoding form'?

The Appendix A definition of terminology is a pretty good, but doesn't
mention character encoding [form], which is the mapping from a code
points in a character set to code units. This is actually the more
commonly needed term.

Note too the opportunity to harmonize with I18N Glossary
<https://w3c.github.io/i18n-glossary/#c>

---

#1977: Missing discussion of surrogates?

There is a some discussion of surrogates in the appendices, but no
mention of them in the body of the document, especially near the ABNF.
It's probably a good idea to at least mention their exclusion somewhere
in Section 6.

---

#1978: Quirks in the history?

There are a variety of places where one could take issue with the
"history of Unicode" in Appendix B. I don't see any technical issues and
don't really want to suggest any alterations, since this version of
history conveys all of the important technical details and leaves out or
alters some things that probably only matter to historians. Making this
issue to note that we didn't ignore it.

---

#1979: NFC and specifications

Appendix C discusses Unicode normalization and the NFC form. The focus
is on implementations, but there probably should be a mention of
specifications (that is, I-Ds and other IETF technical documents) here
(as with charmod-norm). It is primarily name/value matching that is
affected by potential non-normalization. Specifications need to require
(or forbid!) it in matching/uniqueness algorithms without requiring
implementations to do Early Uniform Normalization on the wire.

---

Thanks!

Best regards (for W3C I18N),

Addison

--
Addison Phillips
Chair (W3C Internationalization WG)

Internationalization is not a feature.
It is an architecture.

Received on Monday, 10 February 2025 22:27:37 UTC