hixie: Reword the stuff about authors not using encodings to make more sense. (whatwg r4307) from poot on 2009-10-23 (public-html-diffs@w3.org from October 2009)

From: poot <cvsmail@w3.org>
Date: Sat, 24 Oct 2009 07:13:16 +0900 (JST)
To: public-html-diffs@w3.org
Message-Id: <20091023221317.2B5FF2BCE1@toro.w3.mag.keio.ac.jp>
hixie: Reword the stuff about authors not using encodings to make more
sense. (whatwg r4307)

http://dev.w3.org/cvsweb/html5/spec/Overview.html?r1=1.3442&r2=1.3443&f=h
http://html5.org/tools/web-apps-tracker?from=4306&to=4307

===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.3442
retrieving revision 1.3443
diff -u -d -r1.3442 -r1.3443
--- Overview.html 23 Oct 2009 22:02:53 -0000 1.3442
+++ Overview.html 23 Oct 2009 22:12:59 -0000 1.3443
@@ -1728,12 +1728,11 @@
   to support do things outside that range?  -->, ignoring bytes that
   are the second and later bytes of multibyte sequences, all
   correspond to single-byte sequences that map to the same Unicode
-  characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href="#refsRFC1345">[RFC1345]</a><p class="note">This includes such encodings as Shift_JIS and
-  variants of ISO-2022, even though it is possible in these encodings
-  for bytes like 0x70 to be part of longer sequences that are
-  unrelated to their interpretation as ASCII. It excludes such
-  encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
-  variants.</p><!--
+  characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href="#refsRFC1345">[RFC1345]</a><p class="note">This includes such encodings as Shift_JIS,
+  HZ-GB-2312, and variants of ISO-2022, even though it is possible in
+  these encodings for bytes like 0x70 to be part of longer sequences
+  that are unrelated to their interpretation as ASCII. It excludes
+  such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p><!--
    We'll have to change that if anyone comes up with a way to have a
    document that is valid as two different encodings at once, with
    different <meta charset> elements applying in each case.
@@ -10405,13 +10404,31 @@
   <code><a href="#meta">meta</a></code> element with an <code title="attr-meta-http-equiv"><a href="#attr-meta-http-equiv">http-equiv</a></code> attribute in the
   <a href="#attr-meta-http-equiv-content-type" title="attr-meta-http-equiv-content-type">Encoding declaration
   state</a>, then the character encoding used must be an
-  <a href="#ascii-compatible-character-encoding">ASCII-compatible character encoding</a>.<p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
-  x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
-  has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
+  <a href="#ascii-compatible-character-encoding">ASCII-compatible character encoding</a>.<p>Authors are encouraged to use UTF-8. Conformance checkers may
+  advise authors against using legacy encodings.<div class="impl">
+
+  <p>Authoring tools should default to using UTF-8 for newly-created
+  documents.</p>
+
+  </div><p>Encodings in which a series of bytes in the range 0x20 to 0x7E
+  can encode characters other than the corresponding characters in the
+  range U+0020 to U+007E represent a potential security vulnerability:
+  a user agent that does not support the encoding (or does not support
+  the label used to declare the encoding, or does not use the same
+  mechanism to detect the encoding of unlabelled content as another
+  user agent) might end up interpreting technically benign plain text
+  content as HTML tags and JavaScript. In particular, this applies to
+  encodings in which the bytes corresponding to "<code title="">&lt;script&gt;</code>" in ASCII can encode a different
+  string. Authors should not use such encodings, which are known to
+  include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
+  JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
+  handling of ASCII "~" -->, encodings based on ISO-2022<!--
   http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
   http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
-  -->, and encodings based on EBCDIC. Authors should not use UTF-32.
-  Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
+  -->, and encodings based on EBCDIC. Furtermore, authors must not use
+  the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
+  this category, because these encodings were never intended for use
+  for Web content.
   <a href="#refsRFC1345">[RFC1345]</a><!-- for the JIS types -->
   <a href="#refsRFC1842">[RFC1842]</a><!-- HZ-GB-2312 -->
   <a href="#refsRFC1468">[RFC1468]</a><!-- ISO-2022-JP -->
@@ -10419,27 +10436,13 @@
   <a href="#refsRFC1554">[RFC1554]</a><!-- ISO-2022-JP-2 -->
   <a href="#refsRFC1922">[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
   <a href="#refsRFC1557">[RFC1557]</a><!-- ISO-2022-KR -->
-  <a href="#refsUNICODE">[UNICODE]</a>
   <a href="#refsCESU8">[CESU8]</a>
   <a href="#refsUTF7">[UTF7]</a>
   <a href="#refsBOCU1">[BOCU1]</a>
   <a href="#refsSCSU">[SCSU]</a>
   <!-- no idea what to reference for EBCDIC, so... -->
-  <p class="note">Most of these encodings are discouraged because of
-  security concerns. If a hostile user can contribute text to a site
-  using these encodings, bugs in the site's whitelisting filter or in
-  a user agent can easily lead to the filter interpreting the
-  contribution as "safe" while the user agent interprets the same
-  contribution as containing a <code><a href="#script">script</a></code> element. This would
-  enable cross-site scripting attacks. By avoiding these encodings,
-  and always providing a <a href="#character-encoding-declaration">character encoding declaration</a>,
-  an author is less likely to run into this kind of problem.<p>Authors are encouraged to use UTF-8. Conformance checkers may
-  advise authors against using legacy encodings.<div class="impl">
-
-  <p>Authoring tools should default to using UTF-8 for newly-created
-  documents.</p>
-
-  </div><p class="note">Using non-UTF-8 encodings can have unexpected
+  <p>Authors should not use UTF-32, as the HTML5 encoding detection
+  algorithms intentionally do not distinguish it from UTF-16. <a href="#refsUNICODE">[UNICODE]</a><p class="note">Using non-UTF-8 encodings can have unexpected
   results on form submission and URL encodings, which use the
   <a href="#document-s-character-encoding">document's character encoding</a> by default.<p>In XHTML, the XML declaration should be used for inline character
   encoding information, if necessary.<div class="example">
Received on Friday, 23 October 2009 22:13:46 UTC