- From: poot <cvsmail@w3.org>
- Date: Fri, 17 Jun 2011 05:55:04 -0400
- To: public-html-diffs@w3.org
hixie: Try to clean up the stuff about Unicode characters. (whatwg r6184) http://dev.w3.org/cvsweb/html5/spec/Overview.html?r1=1.4956&r2=1.4957&f=h http://html5.org/tools/web-apps-tracker?from=6183&to=6184 =================================================================== RCS file: /sources/public/html5/spec/Overview.html,v retrieving revision 1.4956 retrieving revision 1.4957 diff -u -d -r1.4956 -r1.4957 --- Overview.html 3 Jun 2011 00:48:54 -0000 1.4956 +++ Overview.html 3 Jun 2011 19:40:16 -0000 1.4957 @@ -2507,9 +2507,8 @@ HZ-GB-2312, and variants of ISO-2022, even though it is possible in these encodings for bytes like 0x70 to be part of longer sequences that are unrelated to their interpretation as ASCII. It excludes - such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p><p>The term <dfn title="">Unicode character</dfn> is used to mean a - <i title="">Unicode scalar value</i> (i.e. any Unicode code point - that is not a surrogate code point). <a href="#refsUNICODE">[UNICODE]</a><h3 id="conformance-requirements"><span class="secno">2.2 </span>Conformance requirements</h3><p>All diagrams, examples, and notes in this specification are + such encodings as UTF-7, UTF-16, GSM03.38, and EBCDIC variants.</p><p>The term <dfn id="unicode-character">Unicode character</dfn> is used to mean a <i title="">Unicode scalar value</i> (i.e. any Unicode code point that + is not a surrogate code point). <a href="#refsUNICODE">[UNICODE]</a><h3 id="conformance-requirements"><span class="secno">2.2 </span>Conformance requirements</h3><p>All diagrams, examples, and notes in this specification are non-normative, as are all sections explicitly marked non-normative. Everything else in this specification is normative.<p>The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative parts of this document are to be @@ -2948,14 +2947,6 @@ is passed an Infinity or Not-a-Number (NaN) value, a <code><a href="#not_supported_err">NOT_SUPPORTED_ERR</a></code> exception must be raised.</p> - <p>Except where otherwise specified, if a method has an argument - of type <code>DOMString</code>, or if an IDL attribute is assigned - a new value of type <code>DOMString</code>, the user agent must - <span title="dfn-obtain-unicode">convert the - <code>DOMString</code> to a sequence of Unicode characters</span> - to obtain the string on which the algorithms in this specification - are to operate. <a href="#refsWEBIDL">[WEBIDL]</a></p> - </dd> <dt>JavaScript</dt> @@ -5442,7 +5433,9 @@ characters as defined by UTF-8.</p> <p>If any percent-encoded octets in that component are not valid - UTF-8 sequences, then return an error and abort these steps.</p> + UTF-8 sequences (e.g. sequences of percent-encoded octets that + expand to surrogate code points), then return an error and abort + these steps.</p> <p>Apply the IDNA ToASCII algorithm to the matching substring, with both the AllowUnassigned and UseSTD3ASCIIRules flags @@ -13484,11 +13477,11 @@ <dd> - <p>The contents of that file, interpreted as string of - Unicode characters, are the script source.</p> + <p>The contents of that file, interpreted as a Unicode + string, are the script source.</p> - <p>To obtain the string of Unicode characters, the user - agent run the following steps:</p> + <p>To obtain the Unicode string, the user agent run the + following steps:</p> <ol><li><p>If the resource's <a href="#content-type" title="Content-Type">Content Type metadata</a>, if any, specifies a character @@ -13813,11 +13806,11 @@ star = %x002A ; U+002A ASTERISK (*) slash = %x002F ; U+002F SOLIDUS (/) not-newline = %x0000-0009 / %x000B-10FFFF - ; a Unicode character other than U+000A LINE FEED (LF) + ; a <a href="#unicode-character">Unicode character</a> other than U+000A LINE FEED (LF) not-star = %x0000-0029 / %x002B-10FFFF - ; a Unicode character other than U+002A ASTERISK (*) + ; a <a href="#unicode-character">Unicode character</a> other than U+002A ASTERISK (*) not-slash = %x0000-002E / %x0030-10FFFF - ; a Unicode character other than U+002F SOLIDUS (/)</pre><p class="note">This corresponds to putting the contents of the + ; a <a href="#unicode-character">Unicode character</a> other than U+002F SOLIDUS (/)</pre><p class="note">This corresponds to putting the contents of the element in JavaScript comments.<p class="note">This requirement is in addition to the earlier restrictions on the syntax of contents of <code><a href="#the-script-element">script</a></code> elements.<div class="example"> @@ -46033,14 +46026,14 @@ <li><p>Let <var title="">decoded fragid</var> be the result of expanding any sequences of percent-encoded octets in <var title="">fragid</var> that are valid UTF-8 sequences into Unicode characters as defined by UTF-8. If any percent-encoded octets in - that string are not valid UTF-8 sequences, then skip this step and - the next one.</p> + that string are not valid UTF-8 sequences (e.g. they expand to + surrogate code points), then skip this step and the next one.</p> <li><p>If this step was not skipped and there is an element in the - DOM that has an <a href="#concept-id" title="concept-id">ID</a> exactly equal to <var title="">decoded - fragid</var>, then the first such element in tree order is - <a href="#the-indicated-part-of-the-document">the indicated part of the document</a>; stop the algorithm - here.</li> + DOM that has an <a href="#concept-id" title="concept-id">ID</a> exactly equal to + <var title="">decoded fragid</var>, then the first such element in + tree order is <a href="#the-indicated-part-of-the-document">the indicated part of the document</a>; stop + the algorithm here.</li> <li><p>If there is an <code><a href="#the-a-element">a</a></code> element in the DOM that has a <code title="attr-a-name"><a href="#attr-a-name">name</a></code> attribute whose value is @@ -55062,12 +55055,13 @@ TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), U+0020 SPACE, U+003E GREATER-THAN SIGN (>), or U+002F SOLIDUS (/).<h4 id="text-0"><span class="secno">8.1.3 </span>Text</h4><p><dfn id="syntax-text" title="syntax-text">Text</dfn> is allowed inside elements, - attribute values, and comments. Text must consist of Unicode - characters. Text must not contain U+0000 characters. Text must not - contain permanently undefined Unicode characters (noncharacters). - Text must not contain control characters other than <a href="#space-character" title="space character">space characters</a>. Extra constraints - are placed on what is and what is not allowed in text based on where - the text is to be put, as described in the other sections.<h5 id="newlines"><span class="secno">8.1.3.1 </span>Newlines</h5><p><dfn id="syntax-newlines" title="syntax-newlines">Newlines</dfn> in HTML may be + attribute values, and comments. Text must consist of <a href="#unicode-character" title="Unicode character">Unicode characters</a>. Text must not + contain U+0000 characters. Text must not contain permanently + undefined Unicode characters (noncharacters). Text must not contain + control characters other than <a href="#space-character" title="space character">space + characters</a>. Extra constraints are placed on what is and what + is not allowed in text based on where the text is to be put, as + described in the other sections.<h5 id="newlines"><span class="secno">8.1.3.1 </span>Newlines</h5><p><dfn id="syntax-newlines" title="syntax-newlines">Newlines</dfn> in HTML may be represented either as U+000D CARRIAGE RETURN (CR) characters, U+000A LINE FEED (LF) characters, or pairs of U+000D CARRIAGE RETURN (CR), U+000A LINE FEED (LF) characters in that order.<p>Where <a href="#syntax-charref" title="syntax-charref">character references</a> @@ -55226,7 +55220,7 @@ <h4 id="overview-of-the-parsing-model"><span class="secno">8.2.1 </span>Overview of the parsing model</h4> <p>The input to the HTML parsing process consists of a stream of - Unicode characters, which is passed through a + Unicode code points, which is passed through a <a href="#tokenization">tokenization</a> stage followed by a <a href="#tree-construction">tree construction</a> stage. The output is a <code><a href="#document">Document</a></code> object.</p> @@ -55272,7 +55266,7 @@ <h4 id="the-input-stream"><span class="secno">8.2.2 </span>The <dfn>input stream</dfn></h4> - <p>The stream of Unicode characters that comprises the input to the + <p>The stream of Unicode code points that comprises the input to the tokenization stage will be initially seen by the user agent as a stream of bytes (typically coming over the network or from the local file system). The bytes encode the actual characters according to a @@ -55310,8 +55304,8 @@ that encoding is <i>tentative</i> or <i>certain</i>, is <a href="#meta-charset-during-parse">used during the parsing</a> to determine whether to <a href="#change-the-encoding">change the encoding</a>. If no encoding is necessary, e.g. because the parser is operating on a - stream of Unicode characters and doesn't have to use an encoding at - all, then the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> is + Unicode stream and doesn't have to use an encoding at all, then the + <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> is <i>irrelevant</i>.</p> <ol><li><p>If the user has explicitly instructed the user agent to @@ -55916,7 +55910,7 @@ <h5 id="preprocessing-the-input-stream"><span class="secno">8.2.2.3 </span>Preprocessing the input stream</h5> <p>Given an encoding, the bytes in the input stream must be - converted to Unicode characters for the tokenizer, as described by + converted to Unicode code points for the tokenizer, as described by the rules for that encoding, except that the leading U+FEFF BYTE ORDER MARK character, if any, must not be stripped by the encoding layer (it is stripped by the rule below).</p>
Received on Friday, 17 June 2011 09:55:06 UTC