W3C home > Mailing lists > Public > whatwg@whatwg.org > March 2006

[whatwg] Parsing Numeric Character References

From: Lachlan Hunt <lachlan.hunt@lachy.id.au>
Date: Sun, 12 Mar 2006 15:22:35 +1100
Message-ID: <4413A20B.6020408@lachy.id.au>
Hi,
   In section 8.2.1 Tokenising Entities, for a numeric character 
reference, it states:

| If one or more characters match the range, then take them all and
| interpret the string of characters as a number (either hexadecimal
| or decimal as appropriate), and return a character token for the
| Unicode character whose codepoint is that number. If the number is
| not a valid Unicode character (e.g. if the number is higher than
| 1114111), or if the number is zero, then return a character token for
| the U+FFFD REPLACEMENT CHARACTER character instead.

This does not cover the characters in the range from #x80 to #x9F, which 
have historically been treated as code points from the Windows-1252 
repertoire, rather than the control characters from Unicode.  AFAIK, 
this is already interoperably implemented in all browsers.

Characters in the range from #x01 to #x19 (except for whitespace 
characters) are not treated interoperably across platforms.  On Windows, 
Firefox, IE and Opera all displayed characters from some repertoire I 
couldn't identify.  But on Mac: all the browsers displayed either 
nothing or a box (a place holder character).  I think these should all 
return U+FFFD.

The use of characters in either of these ranges should be an easy parse 
error.

-- 
Lachlan Hunt
http://lachy.id.au/
Received on Saturday, 11 March 2006 20:22:35 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:45 UTC