- From: Ian Hickson via cvs-syncmail <cvsmail@w3.org>
- Date: Mon, 13 Feb 2012 22:48:23 +0000
- To: public-html-commits@w3.org
Update of /sources/public/html5/spec In directory hutz:/tmp/cvs-serv15992 Modified Files: Overview.html Log Message: Rejig the wording of the character encoding section to make it more precise and in particular to not make CR processing require look-ahead. (whatwg r6991) Index: Overview.html =================================================================== RCS file: /sources/public/html5/spec/Overview.html,v retrieving revision 1.5583 retrieving revision 1.5584 diff -u -d -r1.5583 -r1.5584 --- Overview.html 13 Feb 2012 21:07:05 -0000 1.5583 +++ Overview.html 13 Feb 2012 22:48:18 -0000 1.5584 @@ -1153,7 +1153,7 @@ <li><a href="#parsing"><span class="secno">8.2 </span>Parsing HTML documents</a> <ol> <li><a href="#overview-of-the-parsing-model"><span class="secno">8.2.1 </span>Overview of the parsing model</a></li> - <li><a href="#the-input-stream"><span class="secno">8.2.2 </span>The input stream</a> + <li><a href="#the-input-byte-stream"><span class="secno">8.2.2 </span>The input byte stream</a> <ol> <li><a href="#determining-the-character-encoding"><span class="secno">8.2.2.1 </span>Determining the character encoding</a></li> <li><a href="#character-encodings-0"><span class="secno">8.2.2.2 </span>Character encodings</a></li> @@ -11354,7 +11354,7 @@ <p>If the document has an <a href="#active-parser">active parser</a> that isn't a <a href="#script-created-parser">script-created parser</a>, and the <a href="#insertion-point">insertion - point</a> associated with that parser's <a href="#the-input-stream">input + point</a> associated with that parser's <a href="#input-stream">input stream</a> is not undefined (that is, it <em>does</em> point to somewhere in the input stream), then the method does nothing. Abort these steps and return the <code><a href="#document">Document</a></code> @@ -11488,7 +11488,7 @@ entry.</li> <li><p>Finally, set the <a href="#insertion-point">insertion point</a> to point at - just before the end of the <a href="#the-input-stream">input stream</a> (which at this + just before the end of the <a href="#input-stream">input stream</a> (which at this point will be empty).</li> <li><p>Return the <code><a href="#document">Document</a></code> on which the method was @@ -11532,7 +11532,7 @@ with the document, then abort these steps.</li> <li><p>Insert an <a href="#explicit-eof-character">explicit "EOF" character</a> at the end - of the parser's <a href="#the-input-stream">input stream</a>.</li> + of the parser's <a href="#input-stream">input stream</a>.</li> <li><p>If there is a <a href="#pending-parsing-blocking-script">pending parsing-blocking script</a>, then abort these steps.</li> @@ -11603,14 +11603,14 @@ the user <a href="#refused-to-allow-the-document-to-be-unloaded">refused to allow the document to be unloaded</a>, then abort these steps. Otherwise, the <a href="#insertion-point">insertion point</a> will point at just before the end of - the (empty) <a href="#the-input-stream">input stream</a>.</p> + the (empty) <a href="#input-stream">input stream</a>.</p> </li> <li> <p>Insert the string consisting of the concatenation of all the - arguments to the method into the <a href="#the-input-stream">input stream</a> just + arguments to the method into the <a href="#input-stream">input stream</a> just before the <a href="#insertion-point">insertion point</a>.</p> </li> @@ -48587,12 +48587,12 @@ an <a href="#html-documents" title="HTML documents">HTML document</a>, set its <a href="#concept-document-content-type" title="concept-document-content-type">content type</a> to "<code title="">text/html</code>", create an <a href="#html-parser">HTML parser</a>, and associate it with the document. Each <a href="#concept-task" title="concept-task">task</a> that the <a href="#networking-task-source">networking task source</a> places on the <a href="#task-queue">task queue</a> while the <a href="#fetch" title="fetch">fetching algorithm</a> runs must then fill the - parser's <a href="#the-input-stream">input stream</a> with the fetched bytes and cause - the <a href="#html-parser">HTML parser</a> to perform the appropriate processing - of the input stream.</p> + parser's <a href="#the-input-byte-stream">input byte stream</a> with the fetched bytes and + cause the <a href="#html-parser">HTML parser</a> to perform the appropriate + processing of the input stream.</p> - <p class="note">The <a href="#the-input-stream">input stream</a> converts bytes into - characters for use in the <a href="#tokenization" title="tokenization">tokenizer</a>. This process relies, in part, + <p class="note">The <a href="#the-input-byte-stream">input byte stream</a> converts bytes + into characters for use in the <a href="#tokenization" title="tokenization">tokenizer</a>. This process relies, in part, on character encoding information found in the real <a href="#content-type" title="Content-Type">Content-Type metadata</a> of the resource; the "sniffed type" is not used for this purpose.</p> @@ -48689,9 +48689,9 @@ state</a>. Each <a href="#concept-task" title="concept-task">task</a> that the <a href="#networking-task-source">networking task source</a> places on the <a href="#task-queue">task queue</a> while the <a href="#fetch" title="fetch">fetching algorithm</a> - runs must then fill the parser's <a href="#the-input-stream">input stream</a> with the - fetched bytes and cause the <a href="#html-parser">HTML parser</a> to perform the - appropriate processing of the input stream.</p> + runs must then fill the parser's <a href="#the-input-byte-stream">input byte stream</a> with + the fetched bytes and cause the <a href="#html-parser">HTML parser</a> to perform + the appropriate processing of the input stream.</p> <p>The rules for how to convert the bytes of the plain text document into actual characters, and the rules for actually rendering the @@ -58158,13 +58158,13 @@ <h4 id="overview-of-the-parsing-model"><span class="secno">8.2.1 </span>Overview of the parsing model</h4> - <p class="overview"><object data="images/parsing-model-overview.svg" height="450" width="345"><img alt="" height="450" src="http://dev.w3.org/html5/spec/images/parsing-model-overview.png" width="345"></object></p> + <p class="overview"><object data="images/parsing-model-overview.svg" height="535" width="345"><img alt="" height="450" src="http://dev.w3.org/html5/spec/images/parsing-model-overview.png" width="345"></object></p> <p>The input to the HTML parsing process consists of a stream of - Unicode code points, which is passed through a - <a href="#tokenization">tokenization</a> stage followed by a <a href="#tree-construction">tree - construction</a> stage. The output is a <code><a href="#document">Document</a></code> - object.</p> + <a href="#unicode-code-point" title="Unicode code point">Unicode code points</a>, which + is passed through a <a href="#tokenization">tokenization</a> stage followed by a + <a href="#tree-construction">tree construction</a> stage. The output is a + <code><a href="#document">Document</a></code> object.</p> <p class="note">Implementations that <a href="#non-scripted">do not support scripting</a> do not have to actually create a DOM @@ -58203,19 +58203,45 @@ </div><div class="impl"> - <h4 id="the-input-stream"><span class="secno">8.2.2 </span>The <dfn>input stream</dfn></h4> + <h4 id="the-input-byte-stream"><span class="secno">8.2.2 </span>The <dfn>input byte stream</dfn></h4> <p>The stream of Unicode code points that comprises the input to the tokenization stage will be initially seen by the user agent as a stream of bytes (typically coming over the network or from the local file system). The bytes encode the actual characters according to a - particular <em>character encoding</em>, which the user agent must - use to decode the bytes into characters.</p> + particular <i>character encoding</i>, which the user agent must use + to decode the bytes into characters.</p> <p class="note">For XML documents, the algorithm user agents must use to determine the character encoding is given by the XML specification. This section does not apply to XML documents. <a href="#refsXML">[XML]</a></p> + <p>The <a href="#encoding-sniffing-algorithm">encoding sniffing algorithm</a> defined below is + used to determine the character encoding.</p> + + <p>Given an encoding, the bytes in the <a href="#the-input-byte-stream">input byte + stream</a> must be converted to Unicode code points for the + tokenizer's <a href="#input-stream">input stream</a>, as described by the rules for + that encoding, except that the leading U+FEFF BYTE ORDER MARK + character, if any, must not be stripped by the encoding layer (it is + stripped by the rule below).</p> + + <p>Bytes or sequences of bytes in the original byte stream that + could not be converted to Unicode code points must be converted to + U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is + UTF-8, the bytes must be <a href="#decoded-as-utf-8-with-error-handling" title="decoded as UTF-8, with error + handling">decoded with the error handling</a> defined in this + specification.</p> + + <p class="note">Bytes or sequences of bytes in the original byte + stream that did not conform to the encoding specification (e.g. + invalid UTF-8 byte sequences in a UTF-8 input byte stream) are + errors that conformance checkers are expected to report.</p> + + <p>Any byte or sequence of bytes in the original byte stream that is + <a href="#misinterpreted-for-compatibility">misinterpreted for compatibility</a> is a <a href="#parse-error">parse + error</a>.</p> + <h5 id="determining-the-character-encoding"><span class="secno">8.2.2.1 </span>Determining the character encoding</h5> @@ -58460,7 +58486,7 @@ </ol><p>The <a href="#document-s-character-encoding">document's character encoding</a> must immediately be set to the value returned from this algorithm, at the same time as the user agent uses the returned value to select the decoder to - use for the input stream.</p> + use for the input byte stream.</p> <hr><p>When an algorithm requires a user agent to <dfn id="prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps. @@ -58470,7 +58496,7 @@ <ol><li> <p>Let <var title="">position</var> be a pointer to a byte in the - input stream, initially pointing at the first byte. If at any + input byte stream, initially pointing at the first byte. If at any point during these steps the user agent either runs out of bytes or reaches its <var title="">end condition</var>, then abort the <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its encoding</a> @@ -58605,8 +58631,8 @@ </dl></li> <li><i>Next byte</i>: Move <var title="">position</var> so it - points at the next byte in the input stream, and return to the step - above labeld <i>loop</i>.</li> + points at the next byte in the input byte stream, and return to the + step above labeld <i>loop</i>.</li> </ol><p>When the <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its encoding</a> algorithm says to <dfn id="concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an attribute</dfn>, @@ -58871,30 +58897,12 @@ <h5 id="preprocessing-the-input-stream"><span class="secno">8.2.2.3 </span>Preprocessing the input stream</h5> - <p>Given an encoding, the bytes in the input stream must be - converted to Unicode code points for the tokenizer, as described by - the rules for that encoding, except that the leading U+FEFF BYTE - ORDER MARK character, if any, must not be stripped by the encoding - layer (it is stripped by the rule below).</p> - - <p>Bytes or sequences of bytes in the original byte stream that - could not be converted to Unicode code points must be converted to - U+FFFD REPLACEMENT CHARACTERs. Specifically, if the encoding is - UTF-8, the bytes must be <a href="#decoded-as-utf-8-with-error-handling" title="decoded as UTF-8, with error - handling">decoded with the error handling</a> defined in this - specification.</p> - - <p class="note">Bytes or sequences of bytes in the original byte - stream that did not conform to the encoding specification - (e.g. invalid UTF-8 byte sequences in a UTF-8 input stream) are - errors that conformance checkers are expected to report.</p> - - <p>Any byte or sequence of bytes in the original byte stream that is - <a href="#misinterpreted-for-compatibility">misinterpreted for compatibility</a> is a <a href="#parse-error">parse - error</a>.</p> + <p>The <dfn id="input-stream">input stream</dfn> consists of the characters pushed + into it as the <a href="#the-input-byte-stream">input byte stream</a> is decoded or from the + various APIs that directly manipulate the input stream.</p> <p>One leading U+FEFF BYTE ORDER MARK character must be ignored if - any are present.</p> + any are present in the <a href="#input-stream">input stream</a>.</p> <p class="note">The requirement to strip a U+FEFF BYTE ORDER MARK character regardless of whether that character was used to determine @@ -58915,18 +58923,18 @@ undefined Unicode characters (noncharacters).</p> <p>U+000D CARRIAGE RETURN (CR) characters and U+000A LINE FEED (LF) - characters are treated specially. Any CR characters that are - followed by LF characters must be removed, and any CR characters not - followed by LF characters must be converted to LF characters. Thus, - newlines in HTML DOMs are represented by LF characters, and there - are never any CR characters in the input to the - <a href="#tokenization">tokenization</a> stage.</p> + characters are treated specially. All CR characters must be + converted to LF characters, and any LF characters that immediately + follow a CR character must be ignored. Thus, newlines in HTML DOMs + are represented by LF characters, and there are never any CR + characters in the input to the <a href="#tokenization">tokenization</a> stage.</p> <p>The <dfn id="next-input-character">next input character</dfn> is the first character in the - input stream that has not yet been <dfn id="consumed">consumed</dfn>. Initially, - the <i><a href="#next-input-character">next input character</a></i> is the first character in the - input. The <dfn id="current-input-character">current input character</dfn> is the last character - to have been <i><a href="#consumed">consumed</a></i>.</p> + <a href="#input-stream">input stream</a> that has not yet been <dfn id="consumed">consumed</dfn> + or explicit ignored by the requirements in this section. Initially, + the <i><a href="#next-input-character">next input character</a></i> is the first character in the input. + The <dfn id="current-input-character">current input character</dfn> is the last character to have + been <i><a href="#consumed">consumed</a></i>.</p> <p>The <dfn id="insertion-point">insertion point</dfn> is the position (just before a character or just before the end of the input stream) where content @@ -58937,9 +58945,9 @@ undefined.</p> <p>The "EOF" character in the tables below is a conceptual character - representing the end of the <a href="#the-input-stream">input stream</a>. If the parser + representing the end of the <a href="#input-stream">input stream</a>. If the parser is a <a href="#script-created-parser">script-created parser</a>, then the end of the - <a href="#the-input-stream">input stream</a> is reached when an <dfn id="explicit-eof-character">explicit "EOF" + <a href="#input-stream">input stream</a> is reached when an <dfn id="explicit-eof-character">explicit "EOF" character</dfn> (inserted by the <code title="dom-document-close"><a href="#dom-document-close">document.close()</a></code> method) is consumed. Otherwise, the "EOF" character is not a real character in the stream, but rather the lack of any further characters.</p> @@ -65347,7 +65355,7 @@ </ol><p>When the user agent is to <dfn id="abort-a-parser">abort a parser</dfn>, it must run the following steps:</p> - <ol><li><p>Throw away any pending content in the <a href="#the-input-stream">input + <ol><li><p>Throw away any pending content in the <a href="#input-stream">input stream</a>, and discard any future content that would have been added to it.</li> @@ -66148,7 +66156,7 @@ <li> - <p>Place into the <a href="#the-input-stream">input stream</a> for the <a href="#html-parser">HTML + <p>Place into the <a href="#input-stream">input stream</a> for the <a href="#html-parser">HTML parser</a> just created the <var title="">input</var>. The encoding <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> is <i>irrelevant</i>.</p>
Received on Monday, 13 February 2012 22:48:26 UTC