- From: Elliotte Harold <elharo@metalab.unc.edu>
- Date: Fri, 01 Dec 2006 07:38:45 -0500
In 9.1.3 we see Text must consist of valid Unicode characters other than U+0000. Text should not contain control characters other than space characters. Later in 9.2.3.1 we find: If the number is not a valid Unicode character (e.g. if the number is higher than 1114111), or if the number is zero, then return a character token for the U+FFFD REPLACEMENT CHARACTER character instead. I do not think the Unicode spec defines the notion of a "valid Unicode character". (It does define a valid Unicode code unit sequence, but that's a little different. A code unit sequence generally consists of more than one character.) Thus I suggest we need to be more precise here about what is and is not a valid Unicode character. In particular: 1. Are private use characters allowed? 2. Are control characters allowed (probably yes, based on other parts of the spec). 3. Are surrogate characters allowed? (probably no) 4. Are non-characters beyond 10FFFF allowed (no) 5. Are reserved but currently undefined characters allowed (yes) 6. Are noncharacters U+FDD0..U+FDEF allowed (?) 7. Are the noncharacters from the last two characters of each plane allowed (?) -- ?Elliotte Rusty Harold elharo at metalab.unc.edu Java I/O 2nd Edition Just Published! http://www.cafeaulait.org/books/javaio2/ http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/
Received on Friday, 1 December 2006 04:38:45 UTC