- From: Ian Hickson via cvs-syncmail <cvsmail@w3.org>
- Date: Mon, 13 Feb 2012 21:07:10 +0000
- To: public-html-commits@w3.org
Update of /sources/public/html5/spec In directory hutz:/tmp/cvs-serv2558 Modified Files: Overview.html Log Message: Factor out the prescan algorithm for reuse in other specs. (whatwg r6990) Index: Overview.html =================================================================== RCS file: /sources/public/html5/spec/Overview.html,v retrieving revision 1.5582 retrieving revision 1.5583 diff -u -d -r1.5582 -r1.5583 --- Overview.html 11 Feb 2012 18:43:03 -0000 1.5582 +++ Overview.html 13 Feb 2012 21:07:05 -0000 1.5583 @@ -320,7 +320,7 @@ <h1>HTML5</h1> <h2 class="no-num no-toc" id="a-vocabulary-and-associated-apis-for-html-and-xhtml">A vocabulary and associated APIs for HTML and XHTML</h2> - <h2 class="no-num no-toc" id="editor-s-draft-11-february-2012">Editor's Draft 11 February 2012</h2> + <h2 class="no-num no-toc" id="editor-s-draft-13-february-2012">Editor's Draft 13 February 2012</h2> <dl><dt>Latest Published Version:</dt> <dd><a href="http://www.w3.org/TR/html5/">http://www.w3.org/TR/html5/</a></dd> <dt>Latest Editor's Draft:</dt> @@ -467,7 +467,7 @@ Group</a> is the W3C working group responsible for this specification's progress along the W3C Recommendation track. - This specification is the 11 February 2012 Editor's Draft. + This specification is the 13 February 2012 Editor's Draft. </p><!-- UNDER NO CIRCUMSTANCES IS THE PRECEDING PARAGRAPH TO BE REMOVED OR EDITED WITHOUT TALKING TO IAN FIRST --><p>Work on this specification is also done at the <a href="http://www.whatwg.org/">WHATWG</a>. The W3C HTML working group actively pursues convergence with the WHATWG, as required by the <a href="http://www.w3.org/2007/03/HTML-WG-charter">W3C HTML working group charter</a>.</p><!-- UNDER NO CIRCUMSTANCES IS THE FOLLOWING PARAGRAPH TO BE REMOVED OR EDITED WITHOUT TALKING TO IAN FIRST --><p>This document was produced by a group operating under the <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5 @@ -58232,10 +58232,10 @@ parse of the document with the real encoding.</p> <p id="documentEncoding">User agents must use the following - algorithm (the <dfn id="encoding-sniffing-algorithm">encoding sniffing algorithm</dfn>) to determine - the character encoding to use when decoding a document in the first - pass. This algorithm takes as input any out-of-band metadata - available to the user agent (e.g. the <a href="#content-type" title="Content-Type">Content-Type metadata</a> of the document) + algorithm, called the <dfn id="encoding-sniffing-algorithm">encoding sniffing algorithm</dfn>, to + determine the character encoding to use when decoding a document in + the first pass. This algorithm takes as input any out-of-band + metadata available to the user agent (e.g. the <a href="#content-type" title="Content-Type">Content-Type metadata</a> of the document) and all the bytes available so far, and returns an encoding and a <dfn id="concept-encoding-confidence" title="concept-encoding-confidence">confidence</dfn>. The confidence is either <i>tentative</i>, <i>certain</i>, or @@ -58271,9 +58271,9 @@ <p class="note">The authoring conformance requirements for character encoding declarations limit them to only appearing <a href="#charset1024">in the first 1024 bytes</a>. User agents are - therefore encouraged to use the preparse algorithm below (part of - these steps) on the first 1024 bytes, but not to stall beyond - that.</p> + therefore encouraged to use the prescan algorithm below (as + invoked by these steps) on the first 1024 bytes, but not to stall + beyond that.</p> </li> @@ -58298,315 +58298,28 @@ </table><p class="note">This step looks for Unicode Byte Order Marks (BOMs).</li> - <li><p>Otherwise, the user agent will have to search for explicit - character encoding information in the file itself. This should - proceed as follows: - - <p>Let <var title="">position</var> be a pointer to a byte in the - input stream, initially pointing at the first byte. If at any - point during these substeps the user agent either runs out of - bytes or decides that scanning further bytes would not be - efficient, then skip to the next step of the overall character - encoding detection algorithm. User agents may decide that scanning - <em>any</em> bytes is not efficient, in which case these substeps - are entirely skipped.</p> - - <p>Now, repeat the following "two" steps until the algorithm - aborts (either because user agent aborts, as described above, or - because a character encoding is found):</p> - - <ol><li><p>If <var title="">position</var> points to:</p> - - <dl class="switch"><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt> - <dd> - - <p>Advance the <var title="">position</var> pointer so that it - points at the first 0x3E byte which is preceded by two 0x2D - bytes (i.e. at the end of an ASCII '-->' sequence) and comes - after the 0x3C byte that was found. (The two 0x2D bytes can be - the same as the those in the '<!--' sequence.)</p> - - </dd> - - <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt> - <dd> - - <ol><li><p>Advance the <var title="">position</var> pointer so - that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or - 0x2F byte (the one in sequence of characters matched - above).</li> - - <li><p>Let <var title="">attribute list</var> be an empty - list of strings.</li> - - <li><p>Let <var title="">got pragma</var> be false.</li> - - <li><p>Let <var title="">need pragma</var> be null.</li> - - <li><p>Let <var title="">charset</var> be the null value - (which, for the purposes of this algorithm, is distinct from - an unrecognised encoding or the empty string).</li> - - <li><p><i>Attributes</i>: <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">Get an - attribute</a> and its value. If no attribute was sniffed, - then jump to the <i>processing</i> step below.</li> - - <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step - labeled <i>attributes</i>.</p> - - <li><p>Add the attribute's name to <var title="">attribute - list</var>.</p> - - <li> - - <p>Run the appropriate step from the following list, if one - applies:</p> - - <dl class="switch"><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt> - - <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got - pragma</var> to true.</dd> - - <dt>If the attribute's name is "<code title="">content</code>"</dt> - - <dd><p>Apply the <a href="#algorithm-for-extracting-an-encoding-from-a-meta-element">algorithm for extracting an encoding - from a <code>meta</code> element</a>, giving the - attribute's value as the string to parse. If an encoding is - returned, and if <var title="">charset</var> is still set - to null, let <var title="">charset</var> be the encoding - returned, and set <var title="">need pragma</var> to - true.</dd> - - <dt>If the attribute's name is "<code title="">charset</code>"</dt> - - <dd><p>Let <var title="">charset</var> be the encoding - corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd> - - </dl></li> - - <li><p>Return to the step labeled <i>attributes</i>.</li> - - <li><p><i>Processing</i>: If <var title="">need pragma</var> - is null, then jump to the second step of the overall "two - step" algorithm.</li> - - <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the second - step of the overall "two step" algorithm.</li> - - <li><p>If <var title="">charset</var> is <a href="#a-utf-16-encoding">a UTF-16 - encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li> - - <li><p>If <var title="">charset</var> is not a supported - character encoding, then jump to the second step of the - overall "two step" algorithm.</li> - - <li><p>Return the encoding given by <var title="">charset</var>, with <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> - <i>tentative</i>, and abort all these steps.</li> - - </ol></dd> - - <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt> - <dd> - - <ol><li><p>Advance the <var title="">position</var> pointer so - that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF), - 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E - (ASCII >) byte.</li> - - <li><p>Repeatedly <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an - attribute</a> until no further attributes can be found, - then jump to the second step in the overall "two step" - algorithm.</li> - - </ol></dd> - - <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt> - <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt> - <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt> - <dd> - - <p>Advance the <var title="">position</var> pointer so that it - points at the first 0x3E byte (ASCII >) that comes after the - 0x3C byte that was found.</p> - - </dd> - - <dt>Any other byte</dt> - <dd> - - <p>Do nothing with that byte.</p> - - </dd> - - </dl></li> - - <li>Move <var title="">position</var> so it points at the next - byte in the input stream, and return to the first step of this - "two step" algorithm.</li> - - </ol><p>When the above "two step" algorithm says to <dfn id="concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an - attribute</dfn>, it means doing this:</p> - - <ol><li><p>If the byte at <var title="">position</var> is one of 0x09 - (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), - 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this - substep.</li> - - <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII - >), then abort the "get an attribute" algorithm. There isn't - one.</li> - - <li><p>Otherwise, the byte at <var title="">position</var> is the - start of the attribute name. Let <var title="">attribute - name</var> and <var title="">attribute value</var> be the empty - string.</li> - - <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p> - - <dl class="switch"><dt>If it is 0x3D (ASCII =), and the <var title="">attribute - name</var> is longer than the empty string</dt> - - <dd>Advance <var title="">position</var> to the next byte and - jump to the step below labeled <i>value</i>.</dd> - - <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII - FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt> - - <dd>Jump to the step below labeled <i>spaces</i>.</dd> - - <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt> - - <dd>Abort the "get an attribute" algorithm. The attribute's - name is the value of <var title="">attribute name</var>, its - value is the empty string.</dd> - - <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII - Z)</dt> - - <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is - the value of the byte at <var title="">position</var>). (This - converts the input to lowercase.)</dd> - - <dt>Anything else</dt> - - <dd>Append the Unicode character with the same code point as the - value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how - bytes outside the ASCII range are handled here, since only - ASCII characters can contribute to the detection of a character - encoding.)</dd> - - </dl></li> - - <li><p>Advance <var title="">position</var> to the next byte and - return to the previous step.</li> - - <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII - LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then - advance <var title="">position</var> to the next byte, then, - repeat this step.</li> - - <li><p>If the byte at <var title="">position</var> is - <em>not</em> 0x3D (ASCII =), abort the "get an attribute" - algorithm. The attribute's name is the value of <var title="">attribute name</var>, its value is the empty - string.</li> - - <li><p>Advance <var title="">position</var> past the 0x3D (ASCII - =) byte.</li> - - <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII - LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then - advance <var title="">position</var> to the next byte, then, - repeat this step.</li> - - <li><p>Process the byte at <var title="">position</var> as - follows:</p> - - <dl class="switch"><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt> - - <dd> - - <ol><li>Let <var title="">b</var> be the value of the byte at - <var title="">position</var>.</li> - - <li>Advance <var title="">position</var> to the next - byte.</li> - - <li>If the value of the byte at <var title="">position</var> - is the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get - an attribute" algorithm. The attribute's name is the value of - <var title="">attribute name</var>, and its value is the - value of <var title="">attribute value</var>.</li> - - <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to - 0x5A (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more - than the value of the byte at <var title="">position</var>.</li> - - <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as - the value of the byte at <var title="">position</var>.</li> - - <li>Return to the second step in these substeps.</li> - - </ol></dd> - - <dt>If it is 0x3E (ASCII >)</dt> - - <dd>Abort the "get an attribute" algorithm. The attribute's - name is the value of <var title="">attribute name</var>, its - value is the empty string.</dd> - - - <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII - Z)</dt> - - <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute - value</var> (where <var title="">b</var> is the value of the - byte at <var title="">position</var>). Advance <var title="">position</var> to the next byte.</dd> - - <dt>Anything else</dt> - - <dd>Append the Unicode character with the same code point as the - value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd> - - </dl></li> - - <li><p>Process the byte at <var title="">position</var> as - follows:</p> - - <dl class="switch"><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII - FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII - >)</dt> - - <dd>Abort the "get an attribute" algorithm. The attribute's - name is the value of <var title="">attribute name</var> and its - value is the value of <var title="">attribute value</var>.</dd> - - <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII - Z)</dt> - - <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute - value</var> (where <var title="">b</var> is the value of the - byte at <var title="">position</var>).</dd> - - <dt>Anything else</dt> - - <dd>Append the Unicode character with the same code point as the - value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd> - - </dl></li> + <li> - <li><p>Advance <var title="">position</var> to the next byte and - return to the previous step.</li> + <p>Otherwise, optionally <a href="#prescan-a-byte-stream-to-determine-its-encoding" title="prescan a byte stream to + determine its encoding">prescan the byte stream to determine its + encoding</a>. The <var title="">end condition</var> is that the + user agent decides that scanning further bytes would not be + efficient. User agents are encouraged to only prescan the first + 1024 bytes. User agents may decide that scanning <em>any</em> + bytes is not efficient, in which case these substeps are entirely + skipped.</p> - </ol><p>For the sake of interoperability, user agents should not use a - pre-scan algorithm that returns different results than the one - described above. (But, if you do, please at least let us know, so - that we can improve this algorithm and benefit everyone...)</p> + <p>The aforementioned algorithm either aborts unsuccessfully or + returns a character encoding. If it returns a character encoding, + then this algorithm must be aborted, returning the same encoding, + with <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> + <i>tentative</i>.</p> </li> - <li><p>If the user agent has information on the likely encoding for - this page, e.g. based on the encoding of the page when it was last - visited, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> + <li><p>Otherwise, if the user agent has information on the likely + encoding for this page, e.g. based on the encoding of the page when + it was last visited, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a> <i>tentative</i>, and abort these steps.</li> <li> @@ -58749,6 +58462,314 @@ as the user agent uses the returned value to select the decoder to use for the input stream.</p> + <hr><p>When an algorithm requires a user agent to <dfn id="prescan-a-byte-stream-to-determine-its-encoding">prescan a byte + stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps. + These steps either abort unsuccessfully or return a character + encoding.</p> + + <ol><li> + + <p>Let <var title="">position</var> be a pointer to a byte in the + input stream, initially pointing at the first byte. If at any + point during these steps the user agent either runs out of bytes + or reaches its <var title="">end condition</var>, then abort the + <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its encoding</a> + algorithm unsuccessfully.</p> + + </li> + + <li> + + <p><i>Loop</i>: If <var title="">position</var> points to:</p> + + <dl class="switch"><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt> + <dd> + + <p>Advance the <var title="">position</var> pointer so that it + points at the first 0x3E byte which is preceded by two 0x2D + bytes (i.e. at the end of an ASCII '-->' sequence) and comes + after the 0x3C byte that was found. (The two 0x2D bytes can be + the same as the those in the '<!--' sequence.)</p> + + </dd> + + <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt> + <dd> + + <ol><li><p>Advance the <var title="">position</var> pointer so + that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or + 0x2F byte (the one in sequence of characters matched + above).</li> + + <li><p>Let <var title="">attribute list</var> be an empty + list of strings.</li> + + <li><p>Let <var title="">got pragma</var> be false.</li> + + <li><p>Let <var title="">need pragma</var> be null.</li> + + <li><p>Let <var title="">charset</var> be the null value + (which, for the purposes of this algorithm, is distinct from + an unrecognised encoding or the empty string).</li> + + <li><p><i>Attributes</i>: <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">Get an + attribute</a> and its value. If no attribute was sniffed, + then jump to the <i>processing</i> step below.</li> + + <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step + labeled <i>attributes</i>.</p> + + <li><p>Add the attribute's name to <var title="">attribute + list</var>.</p> + + <li> + + <p>Run the appropriate step from the following list, if one + applies:</p> + + <dl class="switch"><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt> + + <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got + pragma</var> to true.</dd> + + <dt>If the attribute's name is "<code title="">content</code>"</dt> + + <dd><p>Apply the <a href="#algorithm-for-extracting-an-encoding-from-a-meta-element">algorithm for extracting an encoding + from a <code>meta</code> element</a>, giving the + attribute's value as the string to parse. If an encoding is + returned, and if <var title="">charset</var> is still set + to null, let <var title="">charset</var> be the encoding + returned, and set <var title="">need pragma</var> to + true.</dd> + + <dt>If the attribute's name is "<code title="">charset</code>"</dt> + + <dd><p>Let <var title="">charset</var> be the encoding + corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd> + + </dl></li> + + <li><p>Return to the step labeled <i>attributes</i>.</li> + + <li><p><i>Processing</i>: If <var title="">need pragma</var> is + null, then jump to the step below labeled <i>next + byte</i>.</li> + + <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the step below + labeled <i>next byte</i>.</li> + + <li><p>If <var title="">charset</var> is <a href="#a-utf-16-encoding">a UTF-16 + encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li> + + <li><p>If <var title="">charset</var> is not a supported + character encoding, then jump to the step below labeled <i>next + byte</i>.</li> + + <li><p>Abort the <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its + encoding</a> algorithm, returning the encoding given by <var title="">charset</var>.</li> + + </ol></dd> + + <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt> + <dd> + + <ol><li><p>Advance the <var title="">position</var> pointer so + that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF), + 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E + (ASCII >) byte.</li> + + <li><p>Repeatedly <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an + attribute</a> until no further attributes can be found, then + jump to the step below labeled <i>next byte</i>.</li> + + </ol></dd> + + <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt> + <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt> + <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt> + <dd> + + <p>Advance the <var title="">position</var> pointer so that it + points at the first 0x3E byte (ASCII >) that comes after the + 0x3C byte that was found.</p> + + </dd> + + <dt>Any other byte</dt> + <dd> + + <p>Do nothing with that byte.</p> + + </dd> + + </dl></li> + + <li><i>Next byte</i>: Move <var title="">position</var> so it + points at the next byte in the input stream, and return to the step + above labeld <i>loop</i>.</li> + + </ol><p>When the <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its + encoding</a> algorithm says to <dfn id="concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an attribute</dfn>, + it means doing this:</p> + + <ol><li><p>If the byte at <var title="">position</var> is one of 0x09 + (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR), + 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this + step.</li> + + <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII + >), then abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an + attribute</a> algorithm. There isn't one.</li> + + <li><p>Otherwise, the byte at <var title="">position</var> is the + start of the attribute name. Let <var title="">attribute name</var> + and <var title="">attribute value</var> be the empty + string.</li> + + <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p> + + <dl class="switch"><dt>If it is 0x3D (ASCII =), and the <var title="">attribute + name</var> is longer than the empty string</dt> + + <dd>Advance <var title="">position</var> to the next byte and + jump to the step below labeled <i>value</i>.</dd> + + <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII + FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt> + + <dd>Jump to the step below labeled <i>spaces</i>.</dd> + + <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt> + + <dd>Abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an + attribute</a> algorithm. The attribute's name is the value of + <var title="">attribute name</var>, its value is the empty + string.</dd> + + <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII + Z)</dt> + + <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is + the value of the byte at <var title="">position</var>). (This + converts the input to lowercase.)</dd> + + <dt>Anything else</dt> + + <dd>Append the Unicode character with the same code point as the + value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how + bytes outside the ASCII range are handled here, since only + ASCII characters can contribute to the detection of a character + encoding.)</dd> + + </dl></li> + + <li><p>Advance <var title="">position</var> to the next byte and + return to the previous step.</li> + + <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII + LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then + advance <var title="">position</var> to the next byte, then, + repeat this step.</li> + + <li><p>If the byte at <var title="">position</var> is <em>not</em> + 0x3D (ASCII =), abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an + attribute</a> algorithm. The attribute's name is the value of + <var title="">attribute name</var>, its value is the empty + string.</li> + + <li><p>Advance <var title="">position</var> past the 0x3D (ASCII + =) byte.</li> + + <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII + LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then + advance <var title="">position</var> to the next byte, then, + repeat this step.</li> + + <li><p>Process the byte at <var title="">position</var> as + follows:</p> + + <dl class="switch"><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt> + + <dd> + + <ol><li>Let <var title="">b</var> be the value of the byte at + <var title="">position</var>.</li> + + <li><i>Quote loop</i>: Advance <var title="">position</var> to + the next byte.</li> + + <li>If the value of the byte at <var title="">position</var> is + the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get an + attribute" algorithm. The attribute's name is the value of <var title="">attribute name</var>, and its value is the value of + <var title="">attribute value</var>.</li> + + <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to 0x5A + (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more + than the value of the byte at <var title="">position</var>.</li> + + <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as + the value of the byte at <var title="">position</var>.</li> + + <li>Return to the step above labeled <i>quote loop</i>.</li> + + </ol></dd> + + <dt>If it is 0x3E (ASCII >)</dt> + + <dd>Abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an + attribute</a> algorithm. The attribute's name is the value of + <var title="">attribute name</var>, its value is the empty + string.</dd> + + + <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII + Z)</dt> + + <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is + the value of the byte at <var title="">position</var>). Advance + <var title="">position</var> to the next byte.</dd> + + <dt>Anything else</dt> + + <dd>Append the Unicode character with the same code point as the + value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd> + + </dl></li> + + <li><p>Process the byte at <var title="">position</var> as + follows:</p> + + <dl class="switch"><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII + FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII + >)</dt> + + <dd>Abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an + attribute</a> algorithm. The attribute's name is the value of + <var title="">attribute name</var> and its value is the value of + <var title="">attribute value</var>.</dd> + + <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)</dt> + + <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is + the value of the byte at <var title="">position</var>).</dd> + + <dt>Anything else</dt> + + <dd>Append the Unicode character with the same code point as the + value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd> + + </dl></li> + + <li><p>Advance <var title="">position</var> to the next byte and + return to the previous step.</li> + + </ol><p>For the sake of interoperability, user agents should not use a + pre-scan algorithm that returns different results than the one + described above. (But, if you do, please at least let us know, so + that we can improve this algorithm and benefit everyone...)</p> + + + <h5 id="character-encodings-0"><span class="secno">8.2.2.2 </span>Character encodings</h5>
Received on Monday, 13 February 2012 21:07:13 UTC