W3C home > Mailing lists > Public > whatwg@whatwg.org > December 2006

[whatwg] Valid Unicode

From: Sam Ruby <rubys@intertwingly.net>
Date: Sat, 2 Dec 2006 20:47:15 -0500
Message-ID: <3d4032300612021747s3f6b5b73x978991a6f2a0ac9@mail.gmail.com>
On 12/2/06, Henri Sivonen <hsivonen at iki.fi> wrote:
> On Dec 2, 2006, at 18:24, Sam Ruby wrote:
> > It would not be wise for HTML5 to limit itself to the more constrained
> > character set of XML.  In particular, the form feed character is
> > pretty popular,

BTW, I copy and pasted the wrong table.  The characters I mentioned
were discouraged (and include such things as Microsoft smart quotes
mislabeled as iso-8859-1).  The actual allowed set in XML 1.0 is as

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

For XML 1.1 the list is as follows:

[#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

> > This is yet another case where "take HTML5, read it into a DOM, and
> > serialize it as XML, and voil?: you have valid XHTML" doesn't work.
> What I am advocating is making sure that *conforming* HTML5 documents
> can be serialized as XHTML5 without dataloss.

Then you will also need to disallow newlines in attribute values.

In any case, I understand the desire; my read is that the WG's desire
for backwards compatibility is higher.  Limiting the character set to
the allowable XML 1.1 character set should not be a problem for
backwards compatibility purposes.

- Sam Ruby
Received on Saturday, 2 December 2006 17:47:15 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:50 UTC