- From: Řistein E. Andersen <liszt@coq.no>
- Date: Fri, 23 Oct 2009 21:21:07 +0100
On 23 Oct 2009, at 04:20, Ian Hickson wrote: > On Wed, 21 Oct 2009, ?istein E. Andersen wrote: >> > >> ASCII-compatibility: >> The note in ?2.1.5 Character encodings? seems to say that [...] >> ISO-2022?[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and >> I cannot >> find anything in Section 2.1.5 that would explain this difference. > > HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character. > ISO-2022-* uses the control codes. That's the difference. '~'/0x7E is not (and should not be, as far as I can tell) relevant for HTML5's concept of ASCII compatibility. >> Discouraged encodings: [...] >> >>> Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 >>> (JIS_X0212-1990), [...] >> >> It is not clear what this means [...] > > This is talking about character encodings, not character sets. > "JIS_C6226-1983" is a registered character encoding in the IANA > registry. (This is less confusing now since HTML5 only deals with character encodings and the strings match those in the the IANA registry as suggested by Yui Naruse.) >> the list of discouraged encodings seems conspicuously short if it is >> supposed to be complete; and the lack of rationale makes it >> difficult to >> understand why these encodings are considered particularly harmful >> (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but >> two >> at least initially puzzling cases). > > The reason for including these is to discourage encodings known to > have > security issues. I've added HZ-GB-2312, which can be used in a > similarly > dangerous fashion. (Basically the danger for user agents is in an > attacker > using an encoding that a user agent could autodetect, while a site > interprets the bytes safely; that would allow those encodings to be > used > to smuggle <script> elements in a way that a naive whitelisting filter > would think is safe.) > >> It might be better to say *why* particular encodings are better >> avoided, >> whether or not the list of discouraged encodings be presented as >> definitive. > > I've added a note. > > [...] > > On Thu, 22 Oct 2009, Philip Taylor wrote: >> >> The string "[????]" encoded as ISO-2022-KR is the bytes 0e >> 3c 73 >> 63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. >> Chrome, >> when I last checked) will decode it as Windows-1252 and get the >> string >> "<script>", which is bad. So a site that uses ISO-2022-KR is very >> likely >> to expose some users to XSS attacks, which seems like a good reason >> to >> discourage that encoding. The same applies to other ISO-2022 >> encodings. > > [...] > > On Thu, 22 Oct 2009, ?istein E. Andersen wrote: >> >> If that is the reason, at least HZ encoding would seem to be >> affected as >> well. Explicitly discouraging a more or less random subset of the >> problematic encdodings without providing rationale makes it >> difficult to >> assess whether or not other, somewhat similar, encodings should be >> avoided as well, which was the main issue I wanted to raise. > > Hopefully this is somewhat addressed now. The added note certainly helps, but it is vague (does "[m]ost of these encodings" mean "all the encodings mentioned above apart from UTF-32"?) and inaccurate (Philip Taylor's example does not rely on "bugs"). Given that the set of encodings is open-ended, I still think it would be preferable to make the rationale (a definition of what makes an encoding problematic) primary and mention actual encodings as examples. This could give something like the following: "Encodings in which a series of bytes in the range 0x20..0x7E may encode characters other than the corresponding characters in the range U+20..U+7E represent a potential security vulnerability since a browser that does not support the encoding (or does not support the label used to declare the encoding, or does not use the same mechanism to detect the encoding of unlabelled content) might end up interpreting technically benign plain text content as HTML tags and JavaScript. In particular, this applies to encodings in which the bytes corresponding to '<script>' in ASCII may encode a different string. Authors should not use such encodings, which are known to include.... In addition, authors should not use UTF-32 ...." Alternatively, fixing the current note would help and might be sufficient, albeit not ideal. I think one has to realise that a comprehensive list of problematic encodings is an elusive goal and act accordingly. -- ?istein E. Andersen PS: The following sentence makes little sense without (curly) quotes and apostrophes. In case they disappeared before you read it, please find it repeated below with (ASCII) quotes and apostrophes: >> It should probably be ?"advise against authors'? using legacy >> encodings" >> or better "?advise authors against using legacy encodings"?. (The current text in the spec is fine.)
Received on Friday, 23 October 2009 13:21:07 UTC