hixie: Try to tidy up some more of the Unicode/code unit mess with a probably over-reaching definition (there's over 2000 uses of the word 'character' in the text, so I didn't check that all of them use this new definition... hopefully it works out; otherwise, we'll just have to try something else again). (whatwg r6648)

hixie: Try to tidy up some more of the Unicode/code unit mess with a
probably over-reaching definition (there's over 2000 uses of the word
'character' in the text, so I didn't check that all of them use this new
definition... hopefully it works out; otherwise, we'll just have to try
something else again). (whatwg r6648)

http://dev.w3.org/cvsweb/html5/spec/Overview.html?r1=1.5329&r2=1.5330&f=h
http://html5.org/tools/web-apps-tracker?from=6647&to=6648

===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.5329
retrieving revision 1.5330
diff -u -d -r1.5329 -r1.5330
--- Overview.html 6 Oct 2011 06:35:53 -0000 1.5329
+++ Overview.html 6 Oct 2011 23:27:46 -0000 1.5330
@@ -2713,7 +2713,17 @@
   such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p><p>The term <dfn id="a-utf-16-encoding">a UTF-16 encoding</dfn> refers to any variant of
   UTF-16: self-describing UTF-16 with a BOM, ambiguous UTF-16 without
   a BOM, raw UTF-16LE, and raw UTF-16BE. <a href="#refsRFC2781">[RFC2781]</a><p>The term <dfn id="unicode-character">Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
-  is not a surrogate code point). <a href="#refsUNICODE">[UNICODE]</a><h3 id="conformance-requirements"><span class="secno">2.2 </span>Conformance requirements</h3><p>All diagrams, examples, and notes in this specification are
+  is not a surrogate code point). <a href="#refsUNICODE">[UNICODE]</a><p>The term <dfn id="character">character</dfn>, when not qualified as
+  <em>Unicode</em> character, means a <a href="#unicode-character">Unicode character</a>
+  where possible, or a surrogate code point when not: when an
+  algorithm that processes strings is defined in terms of characters,
+  a pair of <span title="code unit">code units</span> consisting of a
+  high surrogate followed by a low surrogate must be treated as a
+  single character, but isolated surrogates must each be treated as a
+  single character also.<p>The <dfn id="code-point-length">code-point length</dfn> of a string is the number of
+  <span title="code unit">code units</span> in that string. <a href="#refsWEBIDL">[WEBIDL]</a><p class="note">This complexity results from the historical decision
+  to define the DOM API in terms of 16 bit (UTF-16) <span title="code
+  unit">code units</span>, rather than in terms of <a href="#unicode-character" title="Unicode character">Unicode characters</a>.<h3 id="conformance-requirements"><span class="secno">2.2 </span>Conformance requirements</h3><p>All diagrams, examples, and notes in this specification are
   non-normative, as are all sections explicitly marked non-normative.
   Everything else in this specification is normative.<p>The key words "MUST", "MUST NOT", "REQUIRED",  "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
   "OPTIONAL" in the normative parts of this document are to be
@@ -3703,9 +3713,6 @@
   whitespace</dfn> from a string, the user agent must remove all <a href="#space-character" title="space character">space characters</a> that are at the
   start or end of the string.</p>
 
-  <p>The <dfn id="code-point-length">code-point length</dfn> of a string is the number of
-  <span title="code unit">code units</span> in that string. <a href="#refsWEBIDL">[WEBIDL]</a></p>
-
   <p>When a user agent has to <dfn id="strictly-split-a-string">strictly split a string</dfn> on a
   particular delimiter character <var title="">delimiter</var>, it
   must use the following algorithm:</p>

Received on Thursday, 6 October 2011 23:28:09 UTC