W3C home > Mailing lists > Public > whatwg@whatwg.org > October 2009

[whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 23 Oct 2009 22:25:54 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0910232224220.9145@hixie.dreamhostps.com>
On Fri, 23 Oct 2009, ?istein E. Andersen wrote:
> On 23 Oct 2009, at 04:20, Ian Hickson wrote:
> > On Wed, 21 Oct 2009, ??istein E. Andersen wrote:
> > >
> > > ASCII-compatibility:
> > > The note in ??2.1.5 Character encodings?? seems to say that [...]
> > > ISO-2022??[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and I
> > > cannot
> > > find anything in Section 2.1.5 that would explain this difference.
> > 
> > HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character.
> > ISO-2022-* uses the control codes. That's the difference.
> 
> '~'/0x7E is not (and should not be, as far as I can tell) relevant for HTML5's
> concept of ASCII compatibility.

Good point. Moved the encoding over to the other side.


> The added note certainly helps, but it is vague (does "[m]ost of these 
> encodings" mean "all the encodings mentioned above apart from UTF-32"?) 
> and inaccurate (Philip Taylor's example does not rely on "bugs").
> 
> Given that the set of encodings is open-ended, I still think it would be 
> preferable to make the rationale (a definition of what makes an encoding 
> problematic) primary and mention actual encodings as examples. This 
> could give something like the following: "Encodings in which a series of 
> bytes in the range 0x20..0x7E may encode characters other than the 
> corresponding characters in the range U+20..U+7E represent a potential 
> security vulnerability since a browser that does not support the 
> encoding (or does not support the label used to declare the encoding, or 
> does not use the same mechanism to detect the encoding of unlabelled 
> content) might end up interpreting technically benign plain text content 
> as HTML tags and JavaScript.  In particular, this applies to encodings 
> in which the bytes corresponding to '<script>' in ASCII may encode a 
> different string. Authors should not use such encodings, which are known 
> to include....  In addition, authors should not use UTF-32 ...." 
> Alternatively, fixing the current note would help and might be 
> sufficient, albeit not ideal.

I've reworded the spec based on your suggestion. Thanks!

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 23 October 2009 15:25:54 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:59:18 UTC