- From: Řistein E. Andersen <html5@xn--istein-9xa.com>
- Date: Tue, 07 Nov 2006 00:28:16 +0100
On 5 Nov 2006, at 1:7PM, Lachlan Hunt wrote: > At the very least, ISO-8859-1 must be treated as Windows-1252. I'm not sure > about the other ISO-8859 encodings. Numeric and hex character references from > 128 to 159 must also be treated as Windows-1252 code points. I think this actually implies that the C1 control characters (range 128--159 in ISO-8859-1 and Unicode) will have to be excluded from conforming HTML5 documents, as there will be no sensible way to include them in documents (with a possible exception for the five codes in this range that are not defined in CP1252, but those should definitely be changed to U+FFFD during parsing). This should not be a problem, though, since these characters were formally banned already in HTML 2.0 (possibly earlier, I am not sure). The characters VT and FF have also been forbidden at least since HTML 2.0, as well as the other C0 control characters except \t, \r, and \n, and also delete (127). Some of the other ISO 8859 encodings also are subsets of Windows Codepages, but these are probably less problematic since CP1250, CP1251, CP1253, ..., CP1258 obtained their IANA registrations much earlier than CP1252 and also because of the preferential treatment that ISO 8859-1 has been given as default in HTML. To summarise, the sensible thing to do would probably be to disallow the following Unicode ranges: C0 (0-31) except \t (9), \n (10), and \r (13) (not allowed in either HTML or XML) Delete (127) (not allowed in HTML) C1 (128--159) (not allowed in HTML) Surrogate characters (U+D800--U+DFFF) (not allowed in either HTML or XML) Explicitly undefined characters (U+FFFE--U+FFFF) (not allowed in XML) If any of these characters occur (directly or via numerical character reference) in a document, they should probably be replaced with the replacement character U+FFFD, with the following exceptions: VT and FF should be transformed into a regular space (if necessary) Characters in the range 128--159 that are also defined in CP1252 should be transformed according to their definition in this encoding (Note that in a document in Mac OS Roman [as an example of a completely different 8-bit encoding], the entity &128; actually is an occurrence of character number 128 in Unicode/ISO-8859-1 and should be transformed into a euro sign under these rules, whereas character number 128 encoded directly [as an 8-bit character] correctly encodes a capital A with umlaut/di??resis and must be left alone.) Would this be too simplistic? -- ??istein E. Andersen
Received on Monday, 6 November 2006 15:28:16 UTC