[I18N-ACTION-586] Summary of WhatWG Infra string


I was tasked by the WG with summarizing the status of WhatWG's work on defining definitions related to string and byte types. This email serves as a summary of my observations.

The spec in question lives here:


WHAT WG has issues of interest here:

Definition of string
Note: This is the primary discussion.

Using generics for bytes / code units / code points
Note: I find this one problematic as it wants to mix bytes, code points, and characters together when defining the term 'ASCII digit'

Define string sorting by "code unit order"
Note: I commented on this one. Code unit order for sorting strings is only faintly reasonable if one is using UTF-16 code units. UTF-8 code units (bytes) produce a not-very-useful order. UTF-16 isn't ideal either, since supplementary characters end up in a blob before 0xE000 and after D7FF. Code point order is at least Unicode order. Note also that this isn't really collation so much as it is producing a deterministic order.

byte sequence backtick representation handling of C0 controls
Note: this is my comment; it's a minor issue

The main problem that I have with the Infra spec is that it only defines the minimum of what is needed to write the other WHAT WG specs. The results are not technically incorrect, as long as one knows what one is doing and what is intended by the specification. My main concern is that the terms byte / byte sequence and code point / string  are treated interchangeably when it is valid and convenient to do so and that this could promote confusion by those unfamiliar with the mechanics of UTF-8 and Unicode.

For example, the definition of 'byte sequence' contains this example:

> Headers, such as `Content-Type`, are byte sequences.

While the HTTP protocol, on the wire, is, in fact, a sequence of bytes and so saying that a Header is a "byte sequence" is technically correct, I also think most folks think of these as "Strings" whose serialization happens to be (ASCII) byte sequences. Treating them as strings allows users to ignore details of encoding until they become important and saves having separate byte-sequence and string-specific string functions. UTF-8's design allows for and takes advantage of this.

This is seen by the fact that the specs defines functions for uppercasing and lowercasing a byte sequence:

> To byte-lowercase a byte sequence<https://infra.spec.whatwg.org/#byte-sequence>, increase each byte<https://infra.spec.whatwg.org/#byte> it contains, in the range 0x41 to 0x5A, inclusive, by 0x20.

> To byte-uppercase a byte sequence<https://infra.spec.whatwg.org/#byte-sequence>, subtract each byte<https://infra.spec.whatwg.org/#byte> it contains, in the range 0x61 to 0x7A, inclusive, by 0x20.
Even though the "string" section later in the document provides for a string operation to do the same thing:

> To ASCII lowercase a string<https://infra.spec.whatwg.org/#string>, replace all ASCII upper alphas<https://infra.spec.whatwg.org/#ascii-upper-alpha> in the string<https://infra.spec.whatwg.org/#string> with their corresponding code point<https://infra.spec.whatwg.org/#code-point> in ASCII lower alpha<https://infra.spec.whatwg.org/#ascii-lower-alpha>.

> To ASCII uppercase a string<https://infra.spec.whatwg.org/#string>, replace all ASCII lower alphas<https://infra.spec.whatwg.org/#ascii-lower-alpha> in the string<https://infra.spec.whatwg.org/#string> with their corresponding code point<https://infra.spec.whatwg.org/#code-point> in ASCII upper alpha<https://infra.spec.whatwg.org/#ascii-upper-alpha>.
Byte-casing is only appropriate when a byte-sequence happens to contain text, that is, it is a string in disguise. This happens often enough in Internet protocols, but it seems wasteful and potentially confusing to have separate definitions. Issue #17 above also discusses mixing the usage.

The other potential issue is the use of "code point" to mean what is more-generally referred to as a character: integers from 0x0 to 0x10FFFF inclusive. This is not directly stated. It has to be inferred, since later there is a definition of "scalar value" that excludes the surrogate range (but keeps other non-characters such as U+FFFF and U+1FFFE around). Here's the definition:

> A code point is a Unicode code point and is represented as a four-to-six digit hexadecimal number, typically prefixed with "U+". Often the name of the code point<https://infra.spec.whatwg.org/#code-point> is also included in capital letters afterward, potentially with the rendered form of the code point<https://infra.spec.whatwg.org/#code-point> in parentheses. [UNICODE]<https://infra.spec.whatwg.org/#biblio-unicode>

And here's the definition of string:

> A string is a sequence of code points<https://infra.spec.whatwg.org/#code-point>.

I would propose:

1.      Recommend tighter "non-assumption laden" definitions be adopted. In particular, it should be clear that strings are sequences of code points (but not scalar values) and that a correctly formed surrogate pair is not a pair of code points.

2.      Keep byte/byte sequence and code point/string separate and spell out the connection between them for cases where it is convenient to treat byte sequences as strings.


Received on Monday, 20 February 2017 18:42:28 UTC