- From: Ian Hickson <ian@hixie.ch>
- Date: Fri, 23 Oct 2009 03:20:01 +0000 (UTC)
On Wed, 21 Oct 2009, ?istein E. Andersen wrote: > > ASCII-compatibility: > The note in ?2.1.5 Character encodings? seems to say that ?variants of > ISO-2022? (presumably including common ones like ISO-2022-CN, ISO-2022KR and > ISO-2022-JP) are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot > find anything in Section 2.1.5 that would explain this difference. HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character. ISO-2022-* uses the control codes. That's the difference. > Discouraged encodings: > ?4.2.5.5 Specifying the document's character encoding? advises against > certain encodings. In particular: > > > Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 > > (JIS_X0212-1990), encodings based on ISO-2022, and encodings based on > > EBCDIC. > > It is not clear what this means (e.g., the character set JIS_C6226-1983 in > any encoding, or only when encoded alone according to RFC1345 as described > above); This is talking about character encodings, not character sets. "JIS_C6226-1983" is a registered character encoding in the IANA registry. > the list of discouraged encodings seems conspicuously short if it is > supposed to be complete; and the lack of rationale makes it difficult to > understand why these encodings are considered particularly harmful > (JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two > at least initially puzzling cases). The reason for including these is to discourage encodings known to have security issues. I've added HZ-GB-2312, which can be used in a similarly dangerous fashion. (Basically the danger for user agents is in an attacker using an encoding that a user agent could autodetect, while a site interprets the bytes safely; that would allow those encodings to be used to smuggle <script> elements in a way that a naive whitelisting filter would think is safe.) > It might be better to say *why* particular encodings are better avoided, > whether or not the list of discouraged encodings be presented as > definitive. I've added a note. > (Incidentally, this advice probably deserves not to be ?hidden? in a > section nominally reserved for character encoding *declaration* issues.) Yeah. I considered moving it to the Writing HTML documents section, but that one doesn't apply to conformance checkers, so it ends up being more of a pain, since the advice would have to be split into multiple pieces so that it applied appropriately. It's not a big deal. > Minor grammar detail in 4.2.5.5: > > Conformance checkers may advise against authors using legacy encodings. > > This is ambiguous. It should probably be ?advise against authors? using > legacy encodings? or better ?advise authors against using legacy > encodings?. Fixed. On Fri, 23 Oct 2009, NARUSE, Yui wrote: > >>> > >>> Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212 > >>> (JIS_X0212-1990), encodings based on ISO-2022, and encodings based > >>> on EBCDIC. > > First, JIS-X-0208 and JIS-X-0212 are not in IANA Charsets, moreover > those correct names as spec are JIS X 0208 and JIS X 0212. On Thu, 22 Oct 2009, ?istein E. Andersen wrote: > > I am not sure what you mean; they are both listed at > <http://www.iana.org/assignments/character-sets>: > > Name: JIS_C6226-1983 [RFC1345,KXS2] > MIBenum: 63 > Source: ECMA registry > Alias: iso-ir-87 > Alias: x0208 > Alias: JIS_X0208-1983 > Alias: csISO87JISX0208 > > Name: JIS_X0212-1990 [RFC1345,KXS2] > MIBenum: 98 > Source: ECMA registry > Alias: x0212 > Alias: iso-ir-159 > Alias: csISO159JISX02121990 On Fri, 23 Oct 2009, NARUSE, Yui wrote: > > Where is the word "JIS-X-0208" ? > Where is the word "JIS-X-0212" ? The exact string isn't there, that's why I included the preferred MIME names in brackets in the spec. On Fri, 23 Oct 2009, NARUSE, Yui wrote: > > Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not > ASCII compatible. So they are out of discouraged; mustn't use. You can use non-ASCII-compatible encodings (e.g. UTF-16). > Finally, Why ISO 2022 series is discouraged is not clear. Hopefully this is clear now. > Anyway, most of charsets defined RFC 1345 are not clear. > Conversion table between Unicode is needed. On Thu, 22 Oct 2009, ?istein E. Andersen wrote: > > > moreover those correct names as spec are JIS X 0208 and JIS X 0212. > > (The IANA registry is internally inconsistent and often disagrees with > official standards when it comes to capitalisation, dashes/hyphens, > underscores and spaces, so it is difficult to get this right. Please > excuse me for not always paying due attention to such details in > e-mails. Of course, the specifications should follow either IANA or the > official standard as appropriate, depending on what it is referring to.) > > > Second, JIS_C6226-1983, JIS_X0212-1990, and EBCDICs are not ASCII > > compatible. So they are out of discouraged; mustn't use. > > EBCDIC is clearly not ASCII-compatible and may be unique amongst the > character sets in the IANA registry in providing the full ASCII > repertoire in a different arrangement. > > JIS_C6226-1983 and JIS_X0212-1990 as defined in RFC1345 (i.e., on their > own) do not contain basic ASCII characters at all, so it makes little > sense to use them for HTML documents without adding ASCII or the > ASCII-based JIS C 6220-1969, which would give something like EUC-JP or > ISO-2022-JP. JIS_C6226-1983 contains wide versions of ASCII characters, > but those are not interpreted as HTML mark-up (unless I am mistaken). > JIS_X0212-1990 does not contain ASCII, kana or basic kanji, so it is of > extremely limited usefulness on its own even in a plain-text setting. > Warning against completely useless encodings seems pointless. > > Many other encodings in the IANA registry are ASCII-incompatible in > different ways; what I do not understand is what makes the ones > currently mentioned in the HTML5 draft particularly harmful. > > > Finally, Why ISO 2022 series is discouraged is not clear. > > We agree on this point. > > > Anyway, most of charsets defined RFC 1345 are not clear. Conversion > > table between [those charsets and] Unicode is needed. > > Quite. Anne van Kesteren, I and several others are currently trying to > document how browsers handle different encodings at > <http://wiki.whatwg.org/wiki/Web_Encodings>, and defining mappings to > Unicode is one of the goals. Your contribution would be much > appreciated. Good luck with that. It's much-needed work. On Thu, 22 Oct 2009, Philip Taylor wrote: > > The string "????????????" encoded as ISO-2022-KR is the bytes 0e 3c 73 > 63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. Chrome, > when I last checked) will decode it as Windows-1252 and get the string > "<script>", which is bad. So a site that uses ISO-2022-KR is very likely > to expose some users to XSS attacks, which seems like a good reason to > discourage that encoding. The same applies to other ISO-2022 encodings. Indeed. On Thu, 22 Oct 2009, ?istein E. Andersen wrote: > > If that is the reason, at least HZ encoding would seem to be affected as > well. Explicitly discouraging a more or less random subset of the > problematic encdodings without providing rationale makes it difficult to > assess whether or not other, somewhat similar, encodings should be > avoided as well, which was the main issue I wanted to raise. Hopefully this is somewhat addressed now. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 22 October 2009 20:20:01 UTC