W3C home > Mailing lists > Public > whatwg@whatwg.org > October 2012

Re: [whatwg] Null characters

From: Ian Hickson <ian@hixie.ch>
Date: Tue, 9 Oct 2012 18:47:21 +0000 (UTC)
To: Cameron Zemek <grom@zeminvaders.net>, Anne van Kesteren <annevk@annevk.nl>, Boris Zbarsky <bzbarsky@MIT.EDU>
Message-ID: <Pine.LNX.4.64.1210091845230.1904@ps20323.dreamhostps.com>
Cc: whatwg@whatwg.org
On Tue, 9 Oct 2012, Cameron Zemek wrote:
> On Tue, Oct 9, 2012 at 1:36 PM, Ian Hickson <ian@hixie.ch> wrote:
> > On Tue, 9 Oct 2012, Cameron Zemek wrote:
> >>
> >> I noticed the specification usually treats null characters U+0000 by 
> >> replacing them with the replacement character U+FFFD . The other 
> >> cases it will be ignored by the tree construction stage when the mode 
> >> is 'in body', 'in table text', 'in select'.
> >>
> >> Would it not be simpler and more consistent to just have the Input 
> >> Stream Preprocessor replace all null characters with the replacement 
> >> character.
> >
> > Yes. In fact that's what the spec used to do.
> >
> > Turns out it's not Web-compatible. :-(
> 
> How is it not web-compatible? PS: Maybe a note should be added to the 
> specification that explains this.

I could add a note... based on what Boris described, what would you want 
the note to say and where would you want it placed, such that you would 
have seen it when your original reading caused you to e-mail the list?

(This part of the spec is rather large, and the NULL handling happens all 
over the place, so I don't know where would be best.)


On Tue, 9 Oct 2012, Boris Zbarsky wrote:
> > 
> > But just thinking about it logically what issues would there be in 
> > showing Null character as the replacement character instead? Visually 
> > would see some extra characters if the document author had Null 
> > characters. What is the big deal with doing that?
> 
> It makes text unreadable.  Consider text that's actually UTF-16 but 
> being declared as ISO-8859-1.  If you strip the nulls, it all works out.  
> But if you don't, every other character is a replacement character.
> 
> This is not a rare situation on the web, unfortunately.
> 
> > Why do authors even have null characters in their HTML documents?
> 
> Because they have UTF-16 text in their database that they dump into an 
> ISO-8859-1 document.  They have no idea there are any "null characters" 
> involved.
> 
> > I assume I'm probably missing some historical reason for this
> 
> Yes, that reason is "the browsers all do it this way, so web sites 
> depend on it".

Yup. :-(

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Tuesday, 9 October 2012 18:47:52 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 30 January 2013 18:48:11 GMT