hixie: Try to clean up the stuff about Unicode characters. (whatwg r6184) from poot on 2011-06-17 (public-html-diffs@w3.org from June 2011)

From: poot <cvsmail@w3.org>
Date: Fri, 17 Jun 2011 05:55:04 -0400
To: public-html-diffs@w3.org
Message-Id: <E1QXVlY-0004yC-5p@jay.w3.org>
hixie: Try to clean up the stuff about Unicode characters. (whatwg
r6184)

http://dev.w3.org/cvsweb/html5/spec/Overview.html?r1=1.4956&r2=1.4957&f=h
http://html5.org/tools/web-apps-tracker?from=6183&to=6184

===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.4956
retrieving revision 1.4957
diff -u -d -r1.4956 -r1.4957
--- Overview.html 3 Jun 2011 00:48:54 -0000 1.4956
+++ Overview.html 3 Jun 2011 19:40:16 -0000 1.4957
@@ -2507,9 +2507,8 @@
   HZ-GB-2312, and variants of ISO-2022, even though it is possible in
   these encodings for bytes like 0x70 to be part of longer sequences
   that are unrelated to their interpretation as ASCII. It excludes
-  such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p><p>The term <dfn title="">Unicode character</dfn> is used to mean a
-  <i title="">Unicode scalar value</i> (i.e. any Unicode code point
-  that is not a surrogate code point). <a href="#refsUNICODE">[UNICODE]</a><h3 id="conformance-requirements"><span class="secno">2.2 </span>Conformance requirements</h3><p>All diagrams, examples, and notes in this specification are
+  such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p><p>The term <dfn id="unicode-character">Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
+  is not a surrogate code point). <a href="#refsUNICODE">[UNICODE]</a><h3 id="conformance-requirements"><span class="secno">2.2 </span>Conformance requirements</h3><p>All diagrams, examples, and notes in this specification are
   non-normative, as are all sections explicitly marked non-normative.
   Everything else in this specification is normative.<p>The key words "MUST", "MUST NOT", "REQUIRED",  "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
   "OPTIONAL" in the normative parts of this document are to be
@@ -2948,14 +2947,6 @@
     is passed an Infinity or Not-a-Number (NaN) value, a
     <code><a href="#not_supported_err">NOT_SUPPORTED_ERR</a></code> exception must be raised.</p>
 
-    <p>Except where otherwise specified, if a method has an argument
-    of type <code>DOMString</code>, or if an IDL attribute is assigned
-    a new value of type <code>DOMString</code>, the user agent must
-    <span title="dfn-obtain-unicode">convert the
-    <code>DOMString</code> to a sequence of Unicode characters</span>
-    to obtain the string on which the algorithms in this specification
-    are to operate. <a href="#refsWEBIDL">[WEBIDL]</a></p>
-
    </dd>
 
    <dt>JavaScript</dt>
@@ -5442,7 +5433,9 @@
     characters as defined by UTF-8.</p>
 
     <p>If any percent-encoded octets in that component are not valid
-    UTF-8 sequences, then return an error and abort these steps.</p>
+    UTF-8 sequences (e.g. sequences of percent-encoded octets that
+    expand to surrogate code points), then return an error and abort
+    these steps.</p>
 
     <p>Apply the IDNA ToASCII algorithm to the matching substring,
     with both the AllowUnassigned and UseSTD3ASCIIRules flags
@@ -13484,11 +13477,11 @@
 
          <dd>
 
-          <p>The contents of that file, interpreted as string of
-          Unicode characters, are the script source.</p>
+          <p>The contents of that file, interpreted as a Unicode
+          string, are the script source.</p>
 
-          <p>To obtain the string of Unicode characters, the user
-          agent run the following steps:</p>
+          <p>To obtain the Unicode string, the user agent run the
+          following steps:</p>
 
           <ol><li><p>If the resource's <a href="#content-type" title="Content-Type">Content
            Type metadata</a>, if any, specifies a character
@@ -13813,11 +13806,11 @@
 star          = %x002A ; U+002A ASTERISK (*)
 slash         = %x002F ; U+002F SOLIDUS (/)
 not-newline   = %x0000-0009 / %x000B-10FFFF
-                ; a Unicode character other than U+000A LINE FEED (LF)
+                ; a <a href="#unicode-character">Unicode character</a> other than U+000A LINE FEED (LF)
 not-star      = %x0000-0029 / %x002B-10FFFF
-                ; a Unicode character other than U+002A ASTERISK (*)
+                ; a <a href="#unicode-character">Unicode character</a> other than U+002A ASTERISK (*)
 not-slash     = %x0000-002E / %x0030-10FFFF
-                ; a Unicode character other than U+002F SOLIDUS (/)</pre><p class="note">This corresponds to putting the contents of the
+                ; a <a href="#unicode-character">Unicode character</a> other than U+002F SOLIDUS (/)</pre><p class="note">This corresponds to putting the contents of the
   element in JavaScript comments.<p class="note">This requirement is in addition to the earlier
   restrictions on the syntax of contents of <code><a href="#the-script-element">script</a></code>
   elements.<div class="example">
@@ -46033,14 +46026,14 @@
    <li><p>Let <var title="">decoded fragid</var> be the result of
    expanding any sequences of percent-encoded octets in <var title="">fragid</var> that are valid UTF-8 sequences into Unicode
    characters as defined by UTF-8. If any percent-encoded octets in
-   that string are not valid UTF-8 sequences, then skip this step and
-   the next one.</p>
+   that string are not valid UTF-8 sequences (e.g. they expand to
+   surrogate code points), then skip this step and the next one.</p>
 
    <li><p>If this step was not skipped and there is an element in the
-   DOM that has an <a href="#concept-id" title="concept-id">ID</a> exactly equal to <var title="">decoded
-   fragid</var>, then the first such element in tree order is
-   <a href="#the-indicated-part-of-the-document">the indicated part of the document</a>; stop the algorithm
-   here.</li>
+   DOM that has an <a href="#concept-id" title="concept-id">ID</a> exactly equal to
+   <var title="">decoded fragid</var>, then the first such element in
+   tree order is <a href="#the-indicated-part-of-the-document">the indicated part of the document</a>; stop
+   the algorithm here.</li>
 
    <li><p>If there is an <code><a href="#the-a-element">a</a></code> element in the DOM that has a
    <code title="attr-a-name"><a href="#attr-a-name">name</a></code> attribute whose value is
@@ -55062,12 +55055,13 @@
   TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D
   CARRIAGE RETURN (CR), U+0020 SPACE, U+003E GREATER-THAN SIGN (&gt;), or
   U+002F SOLIDUS (/).<h4 id="text-0"><span class="secno">8.1.3 </span>Text</h4><p><dfn id="syntax-text" title="syntax-text">Text</dfn> is allowed inside elements,
-  attribute values, and comments. Text must consist of Unicode
-  characters. Text must not contain U+0000 characters. Text must not
-  contain permanently undefined Unicode characters (noncharacters).
-  Text must not contain control characters other than <a href="#space-character" title="space character">space characters</a>. Extra constraints
-  are placed on what is and what is not allowed in text based on where
-  the text is to be put, as described in the other sections.<h5 id="newlines"><span class="secno">8.1.3.1 </span>Newlines</h5><p><dfn id="syntax-newlines" title="syntax-newlines">Newlines</dfn> in HTML may be
+  attribute values, and comments. Text must consist of <a href="#unicode-character" title="Unicode character">Unicode characters</a>. Text must not
+  contain U+0000 characters. Text must not contain permanently
+  undefined Unicode characters (noncharacters). Text must not contain
+  control characters other than <a href="#space-character" title="space character">space
+  characters</a>. Extra constraints are placed on what is and what
+  is not allowed in text based on where the text is to be put, as
+  described in the other sections.<h5 id="newlines"><span class="secno">8.1.3.1 </span>Newlines</h5><p><dfn id="syntax-newlines" title="syntax-newlines">Newlines</dfn> in HTML may be
   represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A
   LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR),
   U+000A LINE FEED (LF) characters in that order.<p>Where <a href="#syntax-charref" title="syntax-charref">character references</a>
@@ -55226,7 +55220,7 @@
   <h4 id="overview-of-the-parsing-model"><span class="secno">8.2.1 </span>Overview of the parsing model</h4>
 
   <p>The input to the HTML parsing process consists of a stream of
-  Unicode characters, which is passed through a
+  Unicode code points, which is passed through a
   <a href="#tokenization">tokenization</a> stage followed by a <a href="#tree-construction">tree
   construction</a> stage. The output is a <code><a href="#document">Document</a></code>
   object.</p>
@@ -55272,7 +55266,7 @@
 
   <h4 id="the-input-stream"><span class="secno">8.2.2 </span>The <dfn>input stream</dfn></h4>
 
-  <p>The stream of Unicode characters that comprises the input to the
+  <p>The stream of Unicode code points that comprises the input to the
   tokenization stage will be initially seen by the user agent as a
   stream of bytes (typically coming over the network or from the local
   file system). The bytes encode the actual characters according to a
@@ -55310,8 +55304,8 @@
   that encoding is <i>tentative</i> or <i>certain</i>, is <a href="#meta-charset-during-parse">used during the parsing</a> to
   determine whether to <a href="#change-the-encoding">change the encoding</a>. If no
   encoding is necessary, e.g. because the parser is operating on a
-  stream of Unicode characters and doesn't have to use an encoding at
-  all, then the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> is
+  Unicode stream and doesn't have to use an encoding at all, then the
+  <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> is
   <i>irrelevant</i>.</p>
 
   <ol><li><p>If the user has explicitly instructed the user agent to
@@ -55916,7 +55910,7 @@
   <h5 id="preprocessing-the-input-stream"><span class="secno">8.2.2.3 </span>Preprocessing the input stream</h5>
 
   <p>Given an encoding, the bytes in the input stream must be
-  converted to Unicode characters for the tokenizer, as described by
+  converted to Unicode code points for the tokenizer, as described by
   the rules for that encoding, except that the leading U+FEFF BYTE
   ORDER MARK character, if any, must not be stripped by the encoding
   layer (it is stripped by the rule below).</p>
Received on Friday, 17 June 2011 09:55:06 UTC