Re: AD Review of draft-ietf-httpbis-sfbis-05

Thanks, Martin. I'm happy to incorporate that approach if others don't have objections.

Cheers,


> On 2 Feb 2024, at 11:38 am, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote:
> 
> Hello Mark, others,
> 
> On 2024-02-01 17:02, Carsten Bormann wrote:
>> On 2024-02-01, at 08:05, Mark Nottingham <mnot@mnot.net> wrote:
>>> 
>>> Hi Carsten!
>>> 
>>>> On 30 Jan 2024, at 2:38 am, Carsten Bormann <cabo@tzi.org> wrote:
>>>> 
>>>> On 2024-01-29, at 15:41, Francesca Palombini <francesca.palombini@ericsson.com> wrote:
>>>>> 
>>>>> What parts of [I-D.draft-bray-unichars] is the reader supposed to look at? Or if it is the whole document, could we have some context around it?
>>>> 
>>>> It seems that sfbis refers to Unicode codepoints where it should have referred to Unicode scalar values (what are said to be codepoints now, need to allow encoding in UTF-8, which only applies to Unicode scalar values).
>>> 
>>> People seem to have strong and conflicting beliefs about the correct terminology here -- others have asserted the opposite in my recollection.
> 
> First, the strictly correct answer is that it doesn't matter; both terms would lead to the same result (assuming people read the specs). The reason why is that the spec says, in 4.1.11. Serializing a Display String (https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-sfbis-05#name-serializing-a-display-strin), point 2:
> 
> Let byte_array be the result of applying UTF-8 encoding (Section 3 of [UTF8]) to input_sequence. If encoding fails, fail serialization.
> 
> [UTF8] (RFC 3629) then says:
> 
>   The definition of UTF-8 prohibits encoding character numbers between
>   U+D800 and U+DFFF, which are reserved for use with the UTF-16
>   encoding form (as surrogate pairs) and do not directly represent
>   characters.  When encoding in UTF-8 from UTF-16 data, it is necessary
>   to first decode the UTF-16 data to obtain character numbers, which
>   are then encoded in UTF-8 as described above.
> 
> The net result of this is that if there are any non-Unicode scalar value codepoints, serialization will just fail.
> 
> However, not taking the assumption that people will read the specs (always a safe bet), I'd suggest adding a short note, maybe as follows:
> 
> Please note that [UTF8] prohibits the encoding of codepoints between U+D800 and U+DFFF (surrogates).
> 
> [short aside: It took me a while to figure out that section 3.3.8, entitled "Display Strings", didn't actually specify display strings, but was just a quick intro to display strings. To help future readers, I'd at a minimum change "3. Structured Data Types" to "3. Overview of Structured Data Types" or some such. Also, a pointer to later sections at the start of section 3 would be appreciated.]
> 
> Hope this helps,    Martin.
> 
> 
>>> So I'm afraid that before I'm willing to change the spec (again) I need see a reference supporting any assertions, and agreement on its interpretation.
>> Hi Mark,
>> as sfbis is based on UTF-8, your main reference should be STD63, specifically  RFC3629 [1].
>> Obviously, UTF-8 is based on Unicode standardization work, so the other reference is [2].
>> [1]: https://www.rfc-editor.org/rfc/rfc3629.html
>> [2]: https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf
>> Unicode terminology is sometimes confusing, and it doesn’t help that at the time RFC 3629 was written, there wasn’t a term defined for what the Unicode consortium now clumsily calls “Unicode scalar values”: the set of Unicode characters that Unicode encoding forms (nee Unicode transformation formats) such as UTF-8 can encode.  See this definition (page 119 of [2]:)
>> D76
>>   Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.
>>   As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF(16) and E000(16) to 10FFFF(16), inclusive.
>> When Unicode created the term “Unicode scalar values”, they thought that they could not use the more natural wording “Unicode characters” because Unicode scalar values include some code values that are called “non-characters” (*) in Unicode…  “Unicode characters” is still what most people would understand and what I therefore tend to use in informal conversation.
>> The term "Unicode code points” encompasses the Unicode scalar values as well as some code points that are used inside UTF-16 only.  Before “Unicode scalar values” was defined, “Unicode code points" was used often in its place because it is the encompassing concept, and often still is used because “Unicode scalar values” is so clumsy or simply because documentation is created by copying from old sources.
>> This seems all pretty obvious, until you encounter the problem that a number of platforms are living on a legacy character model that was created as a transition strategy from the original pure 16-bit Unicode they adopted early on.  Applications what work in this space tend to leak out UTF-16 internals, causing a lot of pain [3].  For interchange, we could (and should) ignore that, except that there are people who are convinced that we should share that pain.
>> RFC 3629 [1] calls out specifically that the Unicode code points that are not Unicode scalar values (today’s words) cannot be encoded in UTF-8 on page 5 (mid of Section 3, [4]).
>> To minimize the confusion (and to reduce the number of hooks that the pain-sharers can use to muddy the issue) a standard like yours should try to avoid the generalism “Unicode code points” and talk about “Unicode scalar values” throughout, possibly after copying D76.
>> [3]: https://www.ietf.org/archive/id/draft-bormann-dispatch-modern-network-unicode-03.html#name-history-legacy
>> [4]: https://www.rfc-editor.org/rfc/rfc3629.html#page-5
>> Grüße, Carsten
>> (*) There is lots of structure in the range covered by Unicode scalar values.  A specification that is not intricately bound to those details, but really mostly wants to encode Unicode, is best off to simply use the stable term “Unicode scalar values” in its explanations and ignore those details, which are evolving.


--
Mark Nottingham   https://www.mnot.net/

Received on Sunday, 11 February 2024 23:51:55 UTC