[whatwg] Surrogate pairs and character references from Øistein E. Andersen on 2009-09-17 (public-whatwg-archive@w3.org from September 2009)

From: Øistein E. Andersen <liszt@coq.no>
Date: Thu, 17 Sep 2009 01:38:05 +0100
Message-ID: <698DACBB-E338-49D0-8BF2-6B796C6FED32@coq.no>

It is much clearer now.  Thanks.  Just a few minor issues:

> "Bytes or sequences of bytes in the original byte stream that could  
> not be converted to Unicode characters must be converted to U+FFFD  
> REPLACEMENT CHARACTER code points."

With the new definition of Unicode characters as Unicode scalar  
values, this excludes surrogate code points, which are also handled  
separately (and cause a parse error) in the step quoted below.  You  
may want to say "Unicode code points" rather than "Unicode characters".

"U+FFFD REPLACEMENT CHARACTERs" is sufficient, used elsewhere and  
probably reads better than "U+FFFD REPLACEMENT CHARACTER code points".
> All U+0000 NULL characters and code points in the range U+D800 to U 
> +DFFF in the input must be replaced by U+FFFD REPLACEMENT  
> CHARACTERs. Any occurrences of such characters and code points are  
> parse errors.
>
The phrase "characters and code points" (in the second sentence) is  
awkward given that all characters are in fact code points.

-- 
?istein E. Andersen

Received on Wednesday, 16 September 2009 17:38:05 UTC