- From: Carsten Bormann <cabo@tzi.org>
- Date: Thu, 1 Feb 2024 09:02:30 +0100
- To: Mark Nottingham <mnot@mnot.net>
- Cc: Francesca Palombini <francesca.palombini@ericsson.com>, "draft-ietf-httpbis-sfbis@ietf.org" <draft-ietf-httpbis-sfbis@ietf.org>, HTTP Working Group <ietf-http-wg@w3.org>
On 2024-02-01, at 08:05, Mark Nottingham <mnot@mnot.net> wrote: > > Hi Carsten! > >> On 30 Jan 2024, at 2:38 am, Carsten Bormann <cabo@tzi.org> wrote: >> >> On 2024-01-29, at 15:41, Francesca Palombini <francesca.palombini@ericsson.com> wrote: >>> >>> What parts of [I-D.draft-bray-unichars] is the reader supposed to look at? Or if it is the whole document, could we have some context around it? >> >> It seems that sfbis refers to Unicode codepoints where it should have referred to Unicode scalar values (what are said to be codepoints now, need to allow encoding in UTF-8, which only applies to Unicode scalar values). > > People seem to have strong and conflicting beliefs about the correct terminology here -- others have asserted the opposite in my recollection. > > So I'm afraid that before I'm willing to change the spec (again) I need see a reference supporting any assertions, and agreement on its interpretation. Hi Mark, as sfbis is based on UTF-8, your main reference should be STD63, specifically RFC3629 [1]. Obviously, UTF-8 is based on Unicode standardization work, so the other reference is [2]. [1]: https://www.rfc-editor.org/rfc/rfc3629.html [2]: https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf Unicode terminology is sometimes confusing, and it doesn’t help that at the time RFC 3629 was written, there wasn’t a term defined for what the Unicode consortium now clumsily calls “Unicode scalar values”: the set of Unicode characters that Unicode encoding forms (nee Unicode transformation formats) such as UTF-8 can encode. See this definition (page 119 of [2]:) D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF(16) and E000(16) to 10FFFF(16), inclusive. When Unicode created the term “Unicode scalar values”, they thought that they could not use the more natural wording “Unicode characters” because Unicode scalar values include some code values that are called “non-characters” (*) in Unicode… “Unicode characters” is still what most people would understand and what I therefore tend to use in informal conversation. The term "Unicode code points” encompasses the Unicode scalar values as well as some code points that are used inside UTF-16 only. Before “Unicode scalar values” was defined, “Unicode code points" was used often in its place because it is the encompassing concept, and often still is used because “Unicode scalar values” is so clumsy or simply because documentation is created by copying from old sources. This seems all pretty obvious, until you encounter the problem that a number of platforms are living on a legacy character model that was created as a transition strategy from the original pure 16-bit Unicode they adopted early on. Applications what work in this space tend to leak out UTF-16 internals, causing a lot of pain [3]. For interchange, we could (and should) ignore that, except that there are people who are convinced that we should share that pain. RFC 3629 [1] calls out specifically that the Unicode code points that are not Unicode scalar values (today’s words) cannot be encoded in UTF-8 on page 5 (mid of Section 3, [4]). To minimize the confusion (and to reduce the number of hooks that the pain-sharers can use to muddy the issue) a standard like yours should try to avoid the generalism “Unicode code points” and talk about “Unicode scalar values” throughout, possibly after copying D76. [3]: https://www.ietf.org/archive/id/draft-bormann-dispatch-modern-network-unicode-03.html#name-history-legacy [4]: https://www.rfc-editor.org/rfc/rfc3629.html#page-5 Grüße, Carsten (*) There is lots of structure in the range covered by Unicode scalar values. A specification that is not intricately bound to those details, but really mostly wants to encode Unicode, is best off to simply use the stable term “Unicode scalar values” in its explanations and ignore those details, which are evolving.
Received on Thursday, 1 February 2024 08:02:41 UTC