W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2006

[whatwg] Valid Unicode

From: Sam Ruby <rubys@intertwingly.net>
Date: Sat, 2 Dec 2006 11:24:36 -0500
Message-ID: <3d4032300612020824k34d5cf89of1c598eb4b65165a@mail.gmail.com>
On 12/1/06, Elliotte Harold <elharo at metalab.unc.edu> wrote:
> Henri Sivonen wrote:
>
> >> 6. Are noncharacters U+FDD0..U+FDEF allowed (?)
> >> 7. Are the noncharacters from the last two characters of each plane
> >> allowed (?)
> >
> > I don't have particularly strong feelings here. Putting those characters
> > is HTML is a bad idea, but allowing them is not a problem for HTML5 to
> > XHTML5 conversion and they aren't a common problem like C1 controls.
>
> FFFE and FFFF are specifically forbidden by XML so they should probably
> be forbidden here too. I think the others are allowed.

Unicode (not XML) reserves U+D800 ? U+DFFF as well as U+FFFE and U+FFFF.

XML 1.0 only allows the following characters:

[#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].

It would not be wise for HTML5 to limit itself to the more constrained
character set of XML.  In particular, the form feed character is
pretty popular,

This is yet another case where "take HTML5, read it into a DOM, and
serialize it as XML, and voil?: you have valid XHTML" doesn't work.

> --
> Elliotte Rusty Harold  elharo at metalab.unc.edu
> Java I/O 2nd Edition Just Published!
> http://www.cafeaulait.org/books/javaio2/
> http://www.amazon.com/exec/obidos/ISBN=0596527500/ref=nosim/cafeaulaitA/

- Sam Ruby
Received on Saturday, 2 December 2006 08:24:36 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:50 UTC