- From: Ian Hickson <ian@hixie.ch>
- Date: Wed, 16 Sep 2009 09:40:42 +0000 (UTC)
On Tue, 15 Sep 2009, ?istein E. Andersen wrote: > > > > I suppose we could just change the spec and say that surrogate > > characters (whether literal characters, e.g. in UTF-8, or from > > character references) all get converted to U+FFFD?. > > That seems to be the only reasonable option if handling �� > as U+FFFD U+FFFD is deemed desirable and sufficiently compatible with > existing documents. It would simplify things a bit in non-UTF-16 > environments (as compared to my interpretation of the current text) > without much added complexity in UTF-16 environments. Ok, done. > > The spec says "Bytes or sequences of bytes in the original byte stream > > that could not be converted to Unicode characters must be converted to > > U+FFFD REPLACEMENT CHARACTER code points". > > I take it you mean that \xD800� should turn into \xFFFD� at this > point, which is only supported by the quoted text if "bytes or sequences of > bytes" representing surrogates "[cannot] be converted to Unicode characters" > or, to put it differently, if surrogates are not "Unicode characters". Correct. Surrogates aren't Unicode characters. > Unfortunately for this reading, the term "Unicode character" does not > seem to be defined in HTML5 or in Unicode, I've added a definition to HTML5. The proper Unicode term is "Unicode scalar value", apparently. > and the following paragraph (which appears shortly after the one you > quoted) clearly includes surrogate code points within the concept of > "Unicode character": > > "Any occurrences of any characters in the ranges [...] U+D800 to U+DFFF, > [...] are parse errors. (These are all control characters or permanently > undefined Unicode characters.)" > > Moreover, this paragraph would be pointless if the characters mentioned > therein could never occur at all. I've changed the text to refer to "code points" when it talks about surrogate code points. > The use of "Unicode character" without a definition is fine in other > parts of HTML5, but clearly not sufficiently precise in this instance. > If you want to exclude (unpaired) surrogate code points only, the > appropriate term to use would probably be "Unicode scalar value". Yeah. Fixed. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 16 September 2009 02:40:42 UTC