- From: Ian Hickson via cvs-syncmail <cvsmail@w3.org>
- Date: Fri, 03 Jun 2011 19:40:20 +0000
- To: public-html-commits@w3.org
Update of /sources/public/html5/spec
In directory hutz:/tmp/cvs-serv31625
Modified Files:
Overview.html
Log Message:
Try to clean up the stuff about Unicode characters. (whatwg r6184)
Index: Overview.html
===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.4956
retrieving revision 1.4957
diff -u -d -r1.4956 -r1.4957
--- Overview.html 3 Jun 2011 00:48:54 -0000 1.4956
+++ Overview.html 3 Jun 2011 19:40:16 -0000 1.4957
@@ -2507,9 +2507,8 @@
HZ-GB-2312, and variants of ISO-2022, even though it is possible in
these encodings for bytes like 0x70 to be part of longer sequences
that are unrelated to their interpretation as ASCII. It excludes
- such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p><p>The term <dfn title="">Unicode character</dfn> is used to mean a
- <i title="">Unicode scalar value</i> (i.e. any Unicode code point
- that is not a surrogate code point). <a href="#refsUNICODE">[UNICODE]</a><h3 id="conformance-requirements"><span class="secno">2.2 </span>Conformance requirements</h3><p>All diagrams, examples, and notes in this specification are
+ such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p><p>The term <dfn id="unicode-character">Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that
+ is not a surrogate code point). <a href="#refsUNICODE">[UNICODE]</a><h3 id="conformance-requirements"><span class="secno">2.2 </span>Conformance requirements</h3><p>All diagrams, examples, and notes in this specification are
non-normative, as are all sections explicitly marked non-normative.
Everything else in this specification is normative.<p>The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in the normative parts of this document are to be
@@ -2948,14 +2947,6 @@
is passed an Infinity or Not-a-Number (NaN) value, a
<code><a href="#not_supported_err">NOT_SUPPORTED_ERR</a></code> exception must be raised.</p>
- <p>Except where otherwise specified, if a method has an argument
- of type <code>DOMString</code>, or if an IDL attribute is assigned
- a new value of type <code>DOMString</code>, the user agent must
- <span title="dfn-obtain-unicode">convert the
- <code>DOMString</code> to a sequence of Unicode characters</span>
- to obtain the string on which the algorithms in this specification
- are to operate. <a href="#refsWEBIDL">[WEBIDL]</a></p>
-
</dd>
<dt>JavaScript</dt>
@@ -5442,7 +5433,9 @@
characters as defined by UTF-8.</p>
<p>If any percent-encoded octets in that component are not valid
- UTF-8 sequences, then return an error and abort these steps.</p>
+ UTF-8 sequences (e.g. sequences of percent-encoded octets that
+ expand to surrogate code points), then return an error and abort
+ these steps.</p>
<p>Apply the IDNA ToASCII algorithm to the matching substring,
with both the AllowUnassigned and UseSTD3ASCIIRules flags
@@ -13484,11 +13477,11 @@
<dd>
- <p>The contents of that file, interpreted as string of
- Unicode characters, are the script source.</p>
+ <p>The contents of that file, interpreted as a Unicode
+ string, are the script source.</p>
- <p>To obtain the string of Unicode characters, the user
- agent run the following steps:</p>
+ <p>To obtain the Unicode string, the user agent run the
+ following steps:</p>
<ol><li><p>If the resource's <a href="#content-type" title="Content-Type">Content
Type metadata</a>, if any, specifies a character
@@ -13813,11 +13806,11 @@
star = %x002A ; U+002A ASTERISK (*)
slash = %x002F ; U+002F SOLIDUS (/)
not-newline = %x0000-0009 / %x000B-10FFFF
- ; a Unicode character other than U+000A LINE FEED (LF)
+ ; a <a href="#unicode-character">Unicode character</a> other than U+000A LINE FEED (LF)
not-star = %x0000-0029 / %x002B-10FFFF
- ; a Unicode character other than U+002A ASTERISK (*)
+ ; a <a href="#unicode-character">Unicode character</a> other than U+002A ASTERISK (*)
not-slash = %x0000-002E / %x0030-10FFFF
- ; a Unicode character other than U+002F SOLIDUS (/)</pre><p class="note">This corresponds to putting the contents of the
+ ; a <a href="#unicode-character">Unicode character</a> other than U+002F SOLIDUS (/)</pre><p class="note">This corresponds to putting the contents of the
element in JavaScript comments.<p class="note">This requirement is in addition to the earlier
restrictions on the syntax of contents of <code><a href="#the-script-element">script</a></code>
elements.<div class="example">
@@ -46033,14 +46026,14 @@
<li><p>Let <var title="">decoded fragid</var> be the result of
expanding any sequences of percent-encoded octets in <var title="">fragid</var> that are valid UTF-8 sequences into Unicode
characters as defined by UTF-8. If any percent-encoded octets in
- that string are not valid UTF-8 sequences, then skip this step and
- the next one.</p>
+ that string are not valid UTF-8 sequences (e.g. they expand to
+ surrogate code points), then skip this step and the next one.</p>
<li><p>If this step was not skipped and there is an element in the
- DOM that has an <a href="#concept-id" title="concept-id">ID</a> exactly equal to <var title="">decoded
- fragid</var>, then the first such element in tree order is
- <a href="#the-indicated-part-of-the-document">the indicated part of the document</a>; stop the algorithm
- here.</li>
+ DOM that has an <a href="#concept-id" title="concept-id">ID</a> exactly equal to
+ <var title="">decoded fragid</var>, then the first such element in
+ tree order is <a href="#the-indicated-part-of-the-document">the indicated part of the document</a>; stop
+ the algorithm here.</li>
<li><p>If there is an <code><a href="#the-a-element">a</a></code> element in the DOM that has a
<code title="attr-a-name"><a href="#attr-a-name">name</a></code> attribute whose value is
@@ -55062,12 +55055,13 @@
TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D
CARRIAGE RETURN (CR), U+0020 SPACE, U+003E GREATER-THAN SIGN (>), or
U+002F SOLIDUS (/).<h4 id="text-0"><span class="secno">8.1.3 </span>Text</h4><p><dfn id="syntax-text" title="syntax-text">Text</dfn> is allowed inside elements,
- attribute values, and comments. Text must consist of Unicode
- characters. Text must not contain U+0000 characters. Text must not
- contain permanently undefined Unicode characters (noncharacters).
- Text must not contain control characters other than <a href="#space-character" title="space character">space characters</a>. Extra constraints
- are placed on what is and what is not allowed in text based on where
- the text is to be put, as described in the other sections.<h5 id="newlines"><span class="secno">8.1.3.1 </span>Newlines</h5><p><dfn id="syntax-newlines" title="syntax-newlines">Newlines</dfn> in HTML may be
+ attribute values, and comments. Text must consist of <a href="#unicode-character" title="Unicode character">Unicode characters</a>. Text must not
+ contain U+0000 characters. Text must not contain permanently
+ undefined Unicode characters (noncharacters). Text must not contain
+ control characters other than <a href="#space-character" title="space character">space
+ characters</a>. Extra constraints are placed on what is and what
+ is not allowed in text based on where the text is to be put, as
+ described in the other sections.<h5 id="newlines"><span class="secno">8.1.3.1 </span>Newlines</h5><p><dfn id="syntax-newlines" title="syntax-newlines">Newlines</dfn> in HTML may be
represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A
LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR),
U+000A LINE FEED (LF) characters in that order.<p>Where <a href="#syntax-charref" title="syntax-charref">character references</a>
@@ -55226,7 +55220,7 @@
<h4 id="overview-of-the-parsing-model"><span class="secno">8.2.1 </span>Overview of the parsing model</h4>
<p>The input to the HTML parsing process consists of a stream of
- Unicode characters, which is passed through a
+ Unicode code points, which is passed through a
<a href="#tokenization">tokenization</a> stage followed by a <a href="#tree-construction">tree
construction</a> stage. The output is a <code><a href="#document">Document</a></code>
object.</p>
@@ -55272,7 +55266,7 @@
<h4 id="the-input-stream"><span class="secno">8.2.2 </span>The <dfn>input stream</dfn></h4>
- <p>The stream of Unicode characters that comprises the input to the
+ <p>The stream of Unicode code points that comprises the input to the
tokenization stage will be initially seen by the user agent as a
stream of bytes (typically coming over the network or from the local
file system). The bytes encode the actual characters according to a
@@ -55310,8 +55304,8 @@
that encoding is <i>tentative</i> or <i>certain</i>, is <a href="#meta-charset-during-parse">used during the parsing</a> to
determine whether to <a href="#change-the-encoding">change the encoding</a>. If no
encoding is necessary, e.g. because the parser is operating on a
- stream of Unicode characters and doesn't have to use an encoding at
- all, then the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> is
+ Unicode stream and doesn't have to use an encoding at all, then the
+ <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> is
<i>irrelevant</i>.</p>
<ol><li><p>If the user has explicitly instructed the user agent to
@@ -55916,7 +55910,7 @@
<h5 id="preprocessing-the-input-stream"><span class="secno">8.2.2.3 </span>Preprocessing the input stream</h5>
<p>Given an encoding, the bytes in the input stream must be
- converted to Unicode characters for the tokenizer, as described by
+ converted to Unicode code points for the tokenizer, as described by
the rules for that encoding, except that the leading U+FEFF BYTE
ORDER MARK character, if any, must not be stripped by the encoding
layer (it is stripped by the rule below).</p>
Received on Friday, 3 June 2011 19:40:22 UTC