W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2006

[whatwg] Valid Unicode

From: Elliotte Harold <elharo@metalab.unc.edu>
Date: Fri, 01 Dec 2006 07:38:45 -0500
Message-ID: <45702255.9090009@metalab.unc.edu>
In 9.1.3 we see

Text must consist of valid Unicode characters other than U+0000. Text 
should not contain control characters other than space characters.

Later in we find:

If the number is not a valid Unicode character (e.g. if the number is 
higher than 1114111), or if the number is zero, then return a character 
token for the U+FFFD REPLACEMENT CHARACTER character instead.

I do not think the Unicode spec defines the notion of a "valid Unicode 
character". (It does define a valid Unicode code unit sequence, but 
that's a little different. A code unit sequence generally consists of 
more than one character.) Thus I suggest we need to be more precise here 
about what is and is not a valid Unicode character. In particular:

1. Are private use characters allowed?
2. Are control characters allowed (probably yes, based on other parts of 
the spec).
3. Are surrogate characters allowed? (probably no)
4. Are non-characters beyond 10FFFF allowed (no)
5. Are reserved but currently undefined characters allowed (yes)
6. Are noncharacters U+FDD0..U+FDEF allowed (?)
7. Are the noncharacters from the last two characters of each plane 
allowed (?)

?Elliotte Rusty Harold  elharo at metalab.unc.edu
Java I/O 2nd Edition Just Published!
Received on Friday, 1 December 2006 04:38:45 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:50 UTC