- From: poot <cvsmail@w3.org>
- Date: Sat, 24 Oct 2009 07:13:16 +0900 (JST)
- To: public-html-diffs@w3.org
hixie: Reword the stuff about authors not using encodings to make more
sense. (whatwg r4307)
http://dev.w3.org/cvsweb/html5/spec/Overview.html?r1=1.3442&r2=1.3443&f=h
http://html5.org/tools/web-apps-tracker?from=4306&to=4307
===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.3442
retrieving revision 1.3443
diff -u -d -r1.3442 -r1.3443
--- Overview.html 23 Oct 2009 22:02:53 -0000 1.3442
+++ Overview.html 23 Oct 2009 22:12:59 -0000 1.3443
@@ -1728,12 +1728,11 @@
to support do things outside that range? -->, ignoring bytes that
are the second and later bytes of multibyte sequences, all
correspond to single-byte sequences that map to the same Unicode
- characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href="#refsRFC1345">[RFC1345]</a><p class="note">This includes such encodings as Shift_JIS and
- variants of ISO-2022, even though it is possible in these encodings
- for bytes like 0x70 to be part of longer sequences that are
- unrelated to their interpretation as ASCII. It excludes such
- encodings as UTF-7, UTF-16, HZ-GB-2312, GSM03.38, and EBCDIC
- variants.</p><!--
+ characters as those bytes in ANSI_X3.4-1968 (US-ASCII). <a href="#refsRFC1345">[RFC1345]</a><p class="note">This includes such encodings as Shift_JIS,
+ HZ-GB-2312, and variants of ISO-2022, even though it is possible in
+ these encodings for bytes like 0x70 to be part of longer sequences
+ that are unrelated to their interpretation as ASCII. It excludes
+ such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p><!--
We'll have to change that if anyone comes up with a way to have a
document that is valid as two different encodings at once, with
different <meta charset> elements applying in each case.
@@ -10405,13 +10404,31 @@
<code><a href="#meta">meta</a></code> element with an <code title="attr-meta-http-equiv"><a href="#attr-meta-http-equiv">http-equiv</a></code> attribute in the
<a href="#attr-meta-http-equiv-content-type" title="attr-meta-http-equiv-content-type">Encoding declaration
state</a>, then the character encoding used must be an
- <a href="#ascii-compatible-character-encoding">ASCII-compatible character encoding</a>.<p>Authors should not use JIS_C6226-1983<!-- aka JIS-X-0208,
- x-JIS0208 -->, JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!--
- has crazy handling of ASCII "~" -->, encodings based on ISO-2022<!--
+ <a href="#ascii-compatible-character-encoding">ASCII-compatible character encoding</a>.<p>Authors are encouraged to use UTF-8. Conformance checkers may
+ advise authors against using legacy encodings.<div class="impl">
+
+ <p>Authoring tools should default to using UTF-8 for newly-created
+ documents.</p>
+
+ </div><p>Encodings in which a series of bytes in the range 0x20 to 0x7E
+ can encode characters other than the corresponding characters in the
+ range U+0020 to U+007E represent a potential security vulnerability:
+ a user agent that does not support the encoding (or does not support
+ the label used to declare the encoding, or does not use the same
+ mechanism to detect the encoding of unlabelled content as another
+ user agent) might end up interpreting technically benign plain text
+ content as HTML tags and JavaScript. In particular, this applies to
+ encodings in which the bytes corresponding to "<code title=""><script></code>" in ASCII can encode a different
+ string. Authors should not use such encodings, which are known to
+ include JIS_C6226-1983<!-- aka JIS-X-0208, x-JIS0208 -->,
+ JIS_X0212-1990<!-- aka JIS-X-0212 -->, HZ-GB-2312<!-- has crazy
+ handling of ASCII "~" -->, encodings based on ISO-2022<!--
http://krijnhoetmer.nl/irc-logs/whatwg/20090628#l-422 and
http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2009-October/023797.html
- -->, and encodings based on EBCDIC. Authors should not use UTF-32.
- Authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings.
+ -->, and encodings based on EBCDIC. Furtermore, authors must not use
+ the CESU-8, UTF-7, BOCU-1 and SCSU encodings, which also fall into
+ this category, because these encodings were never intended for use
+ for Web content.
<a href="#refsRFC1345">[RFC1345]</a><!-- for the JIS types -->
<a href="#refsRFC1842">[RFC1842]</a><!-- HZ-GB-2312 -->
<a href="#refsRFC1468">[RFC1468]</a><!-- ISO-2022-JP -->
@@ -10419,27 +10436,13 @@
<a href="#refsRFC1554">[RFC1554]</a><!-- ISO-2022-JP-2 -->
<a href="#refsRFC1922">[RFC1922]</a><!-- ISO-2022-CN and ISO-2022-CN-EXT -->
<a href="#refsRFC1557">[RFC1557]</a><!-- ISO-2022-KR -->
- <a href="#refsUNICODE">[UNICODE]</a>
<a href="#refsCESU8">[CESU8]</a>
<a href="#refsUTF7">[UTF7]</a>
<a href="#refsBOCU1">[BOCU1]</a>
<a href="#refsSCSU">[SCSU]</a>
<!-- no idea what to reference for EBCDIC, so... -->
- <p class="note">Most of these encodings are discouraged because of
- security concerns. If a hostile user can contribute text to a site
- using these encodings, bugs in the site's whitelisting filter or in
- a user agent can easily lead to the filter interpreting the
- contribution as "safe" while the user agent interprets the same
- contribution as containing a <code><a href="#script">script</a></code> element. This would
- enable cross-site scripting attacks. By avoiding these encodings,
- and always providing a <a href="#character-encoding-declaration">character encoding declaration</a>,
- an author is less likely to run into this kind of problem.<p>Authors are encouraged to use UTF-8. Conformance checkers may
- advise authors against using legacy encodings.<div class="impl">
-
- <p>Authoring tools should default to using UTF-8 for newly-created
- documents.</p>
-
- </div><p class="note">Using non-UTF-8 encodings can have unexpected
+ <p>Authors should not use UTF-32, as the HTML5 encoding detection
+ algorithms intentionally do not distinguish it from UTF-16. <a href="#refsUNICODE">[UNICODE]</a><p class="note">Using non-UTF-8 encodings can have unexpected
results on form submission and URL encodings, which use the
<a href="#document-s-character-encoding">document's character encoding</a> by default.<p>In XHTML, the XML declaration should be used for inline character
encoding information, if necessary.<div class="example">
Received on Friday, 23 October 2009 22:13:46 UTC