[whatwg/webidl] “Unicode character” should likely say “Unicode scalar value” in intro to lexical grammar (Issue #1080)

The lexical grammar is introduced with:

> The tokenizer operates on a sequence of Unicode characters [[UNICODE]](https://www.unicode.org/versions/Unicode14.0.0/).

The Unicode standard does not define “Unicode characters” as far as I can tell, so that leaves “characters,” which it does define\*, but it the definitions (plural) don’t seem compatible with how the term is used in this context.

It seems this should probably say USVs. This can be inferred sorta because the [string literal interpretation algorithm](https://webidl.spec.whatwg.org/#string-literal) appears to assume that source text consumed by `string` is already known to be exclusively USVs. I don’t know the ins-and-outs of Perl 5.5.8 regular expressions\*\*, but it seems like the grammar given for `string`, `/"[^"]*"/`, likely doesn’t preclude lone surrogates in itself, which implies that the “operates on a sequence...” statement was meant to establish USVs-only as a prior fact about the input.

---

\* It goes into considerable detail in the “Characters, not glyphs” section of [§ 2.2 Unicode Design Principles](https://www.unicode.org/versions/Unicode14.0.0/ch02.pdf) and provides a [glossary entry for “character”](http://www.unicode.org/glossary/).

\*\* Gave up on figuring this out because IIUC perl’s strings have observable encodings and character sets that impact how its regular expressions get interpreted — so it may be the case that there isn’t a single answer to the question “does `[^"]` match non-USV code points”.

-- 
Reply to this email directly or view it on GitHub:
https://github.com/whatwg/webidl/issues/1080

You are receiving this because you are subscribed to this thread.

Message ID: <whatwg/webidl/issues/1080@github.com>

Received on Saturday, 25 December 2021 05:28:45 UTC