Re: C0 control characters in HTML 5 from Mike Brown on 2007-06-22 (public-html@w3.org from June 2007)

From: Mike Brown <mike@skew.org>
Date: Fri, 22 Jun 2007 01:23:46 -0600 (MDT)
To: Ian Hickson <ian@hixie.ch>
CC: Mike Brown <mike@skew.org>, public-html@w3.org
Message-Id: <200706220723.l5M7NkAZ035200@chilled.skew.org>

Ian Hickson wrote:
> On Mon, 18 Jun 2007, Mike Brown wrote:
> > 
> > HTML 5 seems to now allow the entire U+0001..U+001F range, whereas HTML 
> > 4.x, 3.2, and I think 2.0, as defined by their "document character set" 
> > and SGML profile, have long forbidden all of that range except for tab, 
> > LF, CR, and, inexplicably, FF.
> > 
> > Why is HTML 5 different, and what are the expectations for the 
> > processing of the now-allowed BEL, BS, VT, DEL, and so on? If it was 
> > deliberate, why not put a note of explanation in the spec?
> 
> It was deliberate only insofar as I didn't come across any reason to 
> disallow them. The expectations for their processing are unaffected by 
> whether they are allowed or not.
> 
> What would the note explain?
> 

The note would explain why you feel it's important to include those codes in 
HTML 5, and the fact that there are no expectations of how they're 
interpreted; they're just no longer disallowed. Perhaps I'm just spoiled by 
the HTML 4 spec which mentions things like that.

I'm guessing those control codes were previously disallowed out of a fear that 
there may have been some concern, at the time, for console-based browsers: you 
don't want such a browser to blindly pass control codes to the user's 
terminal. Arguably, that'd be the browser's mistake if it did, but why let the 
language permit it? It also perhaps makes more sense to just disallow such 
codes; they shouldn't be applicable in a modern document language that 
operates on a descriptive level of abstraction, rather than on a level that 
implies direct control of a terminal.

I imagine it may also have been an effort to further deprecate the codes, to 
keep them from finding new life after all the technologies that most of them 
had been invented for were relegated to the recycling heap. It's my 
understanding that they were only included in the UCS, and that the UCS is 
organized the way it is, to placate people who were concerned over 
compatibility. Why prolong the life of these things that should die?

There are some who feel that such deprecration of codes really makes their 
life difficult, though, so XML 1.1's compromise was to go ahead and allow 
them, but discourage their use. See the note in section 2.2 of XML 1.1.

If you do allow all of U+0001..U+001F then you might as well allow 
U+0080..U+009F range as well, no?

Do you have any plans to acknowledge the Windows-1252 confusion for NCRs in 
that range, such as &#128; being treated as Euro by many (most?) browsers?

Received on Friday, 22 June 2007 07:24:24 UTC