Re: C0 control characters in HTML 5

On Fri, 22 Jun 2007, Mike Brown wrote:
> > > 
> > > HTML 5 seems to now allow the entire U+0001..U+001F range, whereas 
> > > HTML 4.x, 3.2, and I think 2.0, as defined by their "document 
> > > character set" and SGML profile, have long forbidden all of that 
> > > range except for tab, LF, CR, and, inexplicably, FF.
> > > 
> > > Why is HTML 5 different, and what are the expectations for the 
> > > processing of the now-allowed BEL, BS, VT, DEL, and so on? If it was 
> > > deliberate, why not put a note of explanation in the spec?
> > 
> > It was deliberate only insofar as I didn't come across any reason to 
> > disallow them. The expectations for their processing are unaffected by 
> > whether they are allowed or not.
> > 
> > What would the note explain?
> 
> The note would explain why you feel it's important to include those 
> codes in HTML 5

I don't feel it's important either way. I don't really have an opinion.


> and the fact that there are no expectations of how they're interpreted; 

Well, there are expectations, they're the same expectations as for any 
other character.


> they're just no longer disallowed. Perhaps I'm just spoiled by the HTML 
> 4 spec which mentions things like that.

Generally the spec doesn't have notes for changes from previous versions, 
there are just so many of them. However, we should indeed note it; Anne, 
would this be something for the "changes since HTML4" doc?


> I'm guessing those control codes were previously disallowed out of a 
> fear that there may have been some concern, at the time, for 
> console-based browsers: you don't want such a browser to blindly pass 
> control codes to the user's terminal.

Right, but disallowing them doesn't affect this at all. I mean, even if 
they're disallowed, people might still include them. So you still have to 
handle them whether they're allowed or not (that's what I meant when I 
said that "the expectations for their processing are unaffected by whether 
they are allowed or not").


> [...] why let the language permit it?

Why disallow it? I don't know, I don't really have a good reason one way 
or the other.


> If you do allow all of U+0001..U+001F then you might as well allow 
> U+0080..U+009F range as well, no?

Sure, they're allowd too.


> Do you have any plans to acknowledge the Windows-1252 confusion for NCRs 
> in that range, such as € being treated as Euro by many (most?) 
> browsers?

That's already covered in the spec.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Friday, 22 June 2007 07:59:26 UTC