hixie: Factor out the prescan algorithm for reuse in other specs. (whatwg r6990) from poot on 2012-02-13 (public-html-diffs@w3.org from February 2012)

From: poot <cvsmail@w3.org>
Date: Mon, 13 Feb 2012 16:07:21 -0500
To: public-html-diffs@w3.org
Message-Id: <E1Rx37J-0005Vb-JF@jay.w3.org>
hixie: Factor out the prescan algorithm for reuse in other specs.
(whatwg r6990)

http://dev.w3.org/cvsweb/html5/spec/Overview.html?r1=1.5582&r2=1.5583&f=h
http://html5.org/tools/web-apps-tracker?from=6989&to=6990

===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.5582
retrieving revision 1.5583
diff -u -d -r1.5582 -r1.5583
--- Overview.html 11 Feb 2012 18:43:03 -0000 1.5582
+++ Overview.html 13 Feb 2012 21:07:05 -0000 1.5583
@@ -320,7 +320,7 @@
 
    <h1>HTML5</h1>
    <h2 class="no-num no-toc" id="a-vocabulary-and-associated-apis-for-html-and-xhtml">A vocabulary and associated APIs for HTML and XHTML</h2>
-   <h2 class="no-num no-toc" id="editor-s-draft-11-february-2012">Editor's Draft 11 February 2012</h2>
+   <h2 class="no-num no-toc" id="editor-s-draft-13-february-2012">Editor's Draft 13 February 2012</h2>
    <dl><dt>Latest Published Version:</dt>
     <dd><a href="http://www.w3.org/TR/html5/">http://www.w3.org/TR/html5/</a></dd>
     <dt>Latest Editor's Draft:</dt>
@@ -467,7 +467,7 @@
   Group</a> is the W3C working group responsible for this
   specification's progress along the W3C Recommendation
   track.
-  This specification is the 11 February 2012 Editor's Draft.
+  This specification is the 13 February 2012 Editor's Draft.
   </p><!-- UNDER NO CIRCUMSTANCES IS THE PRECEDING PARAGRAPH TO BE REMOVED OR EDITED WITHOUT TALKING TO IAN FIRST --><p>Work on this specification is also done at the <a href="http://www.whatwg.org/">WHATWG</a>. The W3C HTML working group
   actively pursues convergence with the WHATWG, as required by the <a href="http://www.w3.org/2007/03/HTML-WG-charter">W3C HTML working
   group charter</a>.</p><!-- UNDER NO CIRCUMSTANCES IS THE FOLLOWING PARAGRAPH TO BE REMOVED OR EDITED WITHOUT TALKING TO IAN FIRST --><p>This document was produced by a group operating under the <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5
@@ -58232,10 +58232,10 @@
   parse of the document with the real encoding.</p>
 
   <p id="documentEncoding">User agents must use the following
-  algorithm (the <dfn id="encoding-sniffing-algorithm">encoding sniffing algorithm</dfn>) to determine
-  the character encoding to use when decoding a document in the first
-  pass. This algorithm takes as input any out-of-band metadata
-  available to the user agent (e.g. the <a href="#content-type" title="Content-Type">Content-Type metadata</a> of the document)
+  algorithm, called the <dfn id="encoding-sniffing-algorithm">encoding sniffing algorithm</dfn>, to
+  determine the character encoding to use when decoding a document in
+  the first pass. This algorithm takes as input any out-of-band
+  metadata available to the user agent (e.g. the <a href="#content-type" title="Content-Type">Content-Type metadata</a> of the document)
   and all the bytes available so far, and returns an encoding and a
   <dfn id="concept-encoding-confidence" title="concept-encoding-confidence">confidence</dfn>. The
   confidence is either <i>tentative</i>, <i>certain</i>, or
@@ -58271,9 +58271,9 @@
 
     <p class="note">The authoring conformance requirements for
     character encoding declarations limit them to only appearing <a href="#charset1024">in the first 1024 bytes</a>. User agents are
-    therefore encouraged to use the preparse algorithm below (part of
-    these steps) on the first 1024 bytes, but not to stall beyond
-    that.</p>
+    therefore encouraged to use the prescan algorithm below (as
+    invoked by these steps) on the first 1024 bytes, but not to stall
+    beyond that.</p>
 
    </li>
 
@@ -58298,315 +58298,28 @@
     </table><p class="note">This step looks for Unicode Byte Order Marks
    (BOMs).</li>
 
-   <li><p>Otherwise, the user agent will have to search for explicit
-   character encoding information in the file itself. This should
-   proceed as follows:
-
-    <p>Let <var title="">position</var> be a pointer to a byte in the
-    input stream, initially pointing at the first byte. If at any
-    point during these substeps the user agent either runs out of
-    bytes or decides that scanning further bytes would not be
-    efficient, then skip to the next step of the overall character
-    encoding detection algorithm. User agents may decide that scanning
-    <em>any</em> bytes is not efficient, in which case these substeps
-    are entirely skipped.</p>
-
-    <p>Now, repeat the following "two" steps until the algorithm
-    aborts (either because user agent aborts, as described above, or
-    because a character encoding is found):</p>
-
-    <ol><li><p>If <var title="">position</var> points to:</p>
-
-      <dl class="switch"><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '&lt;!--')</dt>
-       <dd>
-
-        <p>Advance the <var title="">position</var> pointer so that it
-        points at the first 0x3E byte which is preceded by two 0x2D
-        bytes (i.e. at the end of an ASCII '--&gt;' sequence) and comes
-        after the 0x3C byte that was found. (The two 0x2D bytes can be
-        the same as the those in the '&lt;!--' sequence.)</p>
-
-       </dd>
-
-       <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '&lt;meta' followed by a space or slash)</dt>
-       <dd>
-
-        <ol><li><p>Advance the <var title="">position</var> pointer so
-         that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
-         0x2F byte (the one in sequence of characters matched
-         above).</li>
-
-         <li><p>Let <var title="">attribute list</var> be an empty
-         list of strings.</li> 
-
-         <li><p>Let <var title="">got pragma</var> be false.</li>
-
-         <li><p>Let <var title="">need pragma</var> be null.</li>
-
-         <li><p>Let <var title="">charset</var> be the null value
-         (which, for the purposes of this algorithm, is distinct from
-         an unrecognised encoding or the empty string).</li>
-
-         <li><p><i>Attributes</i>: <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">Get an
-         attribute</a> and its value. If no attribute was sniffed,
-         then jump to the <i>processing</i> step below.</li>
-
-         <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
-         labeled <i>attributes</i>.</p>
-
-         <li><p>Add the attribute's name to <var title="">attribute
-         list</var>.</p>
-
-         <li>
-
-          <p>Run the appropriate step from the following list, if one
-          applies:</p>
-
-          <dl class="switch"><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
-
-           <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
-           pragma</var> to true.</dd>
-
-           <dt>If the attribute's name is "<code title="">content</code>"</dt>
-
-           <dd><p>Apply the <a href="#algorithm-for-extracting-an-encoding-from-a-meta-element">algorithm for extracting an encoding
-           from a <code>meta</code> element</a>, giving the
-           attribute's value as the string to parse. If an encoding is
-           returned, and if <var title="">charset</var> is still set
-           to null, let <var title="">charset</var> be the encoding
-           returned, and set <var title="">need pragma</var> to
-           true.</dd>
-
-           <dt>If the attribute's name is "<code title="">charset</code>"</dt>
-
-           <dd><p>Let <var title="">charset</var> be the encoding
-           corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd>
-
-          </dl></li>
-
-         <li><p>Return to the step labeled <i>attributes</i>.</li>
-
-         <li><p><i>Processing</i>: If <var title="">need pragma</var>
-         is null, then jump to the second step of the overall "two
-         step" algorithm.</li>
-
-         <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the second
-         step of the overall "two step" algorithm.</li>
-
-         <li><p>If <var title="">charset</var> is <a href="#a-utf-16-encoding">a UTF-16
-         encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
-
-         <li><p>If <var title="">charset</var> is not a supported
-         character encoding, then jump to the second step of the
-         overall "two step" algorithm.</li>
-
-         <li><p>Return the encoding given by <var title="">charset</var>, with <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
-         <i>tentative</i>, and abort all these steps.</li>
-
-        </ol></dd>
-
-       <dt>A sequence of bytes starting with a 0x3C byte (ASCII &lt;), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
-       <dd>
-
-        <ol><li><p>Advance the <var title="">position</var> pointer so
-         that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
-         0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
-         (ASCII &gt;) byte.</li>
-
-         <li><p>Repeatedly <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
-         attribute</a> until no further attributes can be found,
-         then jump to the second step in the overall "two step"
-         algorithm.</li>
-
-        </ol></dd>
-
-       <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '&lt;!')</dt>
-       <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '&lt;/')</dt>
-       <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '&lt;?')</dt>
-       <dd>
-
-        <p>Advance the <var title="">position</var> pointer so that it
-        points at the first 0x3E byte (ASCII &gt;) that comes after the
-        0x3C byte that was found.</p>
-
-       </dd>
-
-       <dt>Any other byte</dt>
-       <dd>
-
-        <p>Do nothing with that byte.</p>
-
-       </dd>
-
-      </dl></li>
-
-     <li>Move <var title="">position</var> so it points at the next
-     byte in the input stream, and return to the first step of this
-     "two step" algorithm.</li>
-
-    </ol><p>When the above "two step" algorithm says to <dfn id="concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
-    attribute</dfn>, it means doing this:</p>
-
-    <ol><li><p>If the byte at <var title="">position</var> is one of 0x09
-     (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
-     0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
-     substep.</li>
-
-     <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
-     &gt;), then abort the "get an attribute" algorithm. There isn't
-     one.</li>
-
-     <li><p>Otherwise, the byte at <var title="">position</var> is the
-     start of the attribute name. Let <var title="">attribute
-     name</var> and <var title="">attribute value</var> be the empty
-     string.</li>
-
-     <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
-
-      <dl class="switch"><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
-       name</var> is longer than the empty string</dt>
-
-       <dd>Advance <var title="">position</var> to the next byte and
-       jump to the step below labeled <i>value</i>.</dd>
-
-       <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
-       FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
-
-       <dd>Jump to the step below labeled <i>spaces</i>.</dd>
-
-       <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII &gt;)</dt>
-
-       <dd>Abort the "get an attribute" algorithm. The attribute's
-       name is the value of <var title="">attribute name</var>, its
-       value is the empty string.</dd>
-
-       <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
-       Z)</dt>
-
-       <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is
-       the value of the byte at <var title="">position</var>). (This
-       converts the input to lowercase.)</dd>
-
-       <dt>Anything else</dt>
-
-       <dd>Append the Unicode character with the same code point as the
-       value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
-       bytes outside the ASCII range are handled here, since only
-       ASCII characters can contribute to the detection of a character
-       encoding.)</dd>
-
-      </dl></li>
-
-     <li><p>Advance <var title="">position</var> to the next byte and
-     return to the previous step.</li>
-
-     <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
-     LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
-     advance <var title="">position</var> to the next byte, then,
-     repeat this step.</li>
-
-     <li><p>If the byte at <var title="">position</var> is
-     <em>not</em> 0x3D (ASCII =), abort the "get an attribute"
-     algorithm. The attribute's name is the value of <var title="">attribute name</var>, its value is the empty
-     string.</li>
-
-     <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
-     =) byte.</li>
-
-     <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
-     LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
-     advance <var title="">position</var> to the next byte, then,
-     repeat this step.</li>
-
-     <li><p>Process the byte at <var title="">position</var> as
-     follows:</p>
-
-      <dl class="switch"><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
-
-       <dd>
-
-        <ol><li>Let <var title="">b</var> be the value of the byte at
-         <var title="">position</var>.</li>
-
-         <li>Advance <var title="">position</var> to the next
-         byte.</li>
-
-         <li>If the value of the byte at <var title="">position</var>
-         is the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get
-         an attribute" algorithm. The attribute's name is the value of
-         <var title="">attribute name</var>, and its value is the
-         value of <var title="">attribute value</var>.</li>
-
-         <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to
-         0x5A (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
-         than the value of the byte at <var title="">position</var>.</li>
-
-         <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
-         the value of the byte at <var title="">position</var>.</li>
-
-         <li>Return to the second step in these substeps.</li>
-
-        </ol></dd>
-
-       <dt>If it is 0x3E (ASCII &gt;)</dt>
-
-       <dd>Abort the "get an attribute" algorithm. The attribute's
-       name is the value of <var title="">attribute name</var>, its
-       value is the empty string.</dd>
-
-
-       <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
-       Z)</dt>
-
-       <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
-       value</var> (where <var title="">b</var> is the value of the
-       byte at <var title="">position</var>). Advance <var title="">position</var> to the next byte.</dd>
-
-       <dt>Anything else</dt>
-
-       <dd>Append the Unicode character with the same code point as the
-       value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
-
-      </dl></li>
-
-     <li><p>Process the byte at <var title="">position</var> as
-     follows:</p>
-
-      <dl class="switch"><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
-       FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
-       &gt;)</dt>
-
-       <dd>Abort the "get an attribute" algorithm. The attribute's
-       name is the value of <var title="">attribute name</var> and its
-       value is the value of <var title="">attribute value</var>.</dd>
-
-       <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
-       Z)</dt>
-
-       <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
-       value</var> (where <var title="">b</var> is the value of the
-       byte at <var title="">position</var>).</dd>
-
-       <dt>Anything else</dt>
-
-       <dd>Append the Unicode character with the same code point as the
-       value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
-
-      </dl></li>
+   <li>
 
-     <li><p>Advance <var title="">position</var> to the next byte and
-     return to the previous step.</li>
+    <p>Otherwise, optionally <a href="#prescan-a-byte-stream-to-determine-its-encoding" title="prescan a byte stream to
+    determine its encoding">prescan the byte stream to determine its
+    encoding</a>. The <var title="">end condition</var> is that the
+    user agent decides that scanning further bytes would not be
+    efficient. User agents are encouraged to only prescan the first
+    1024 bytes. User agents may decide that scanning <em>any</em>
+    bytes is not efficient, in which case these substeps are entirely
+    skipped.</p>
 
-    </ol><p>For the sake of interoperability, user agents should not use a
-    pre-scan algorithm that returns different results than the one
-    described above. (But, if you do, please at least let us know, so
-    that we can improve this algorithm and benefit everyone...)</p>
+    <p>The aforementioned algorithm either aborts unsuccessfully or
+    returns a character encoding. If it returns a character encoding,
+    then this algorithm must be aborted, returning the same encoding,
+    with <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
+    <i>tentative</i>.</p>
 
    </li>
 
-   <li><p>If the user agent has information on the likely encoding for
-   this page, e.g. based on the encoding of the page when it was last
-   visited, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
+   <li><p>Otherwise, if the user agent has information on the likely
+   encoding for this page, e.g. based on the encoding of the page when
+   it was last visited, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
    <i>tentative</i>, and abort these steps.</li>
 
    <li>
@@ -58749,6 +58462,314 @@
   as the user agent uses the returned value to select the decoder to
   use for the input stream.</p>
 
+  <hr><p>When an algorithm requires a user agent to <dfn id="prescan-a-byte-stream-to-determine-its-encoding">prescan a byte
+  stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps.
+  These steps either abort unsuccessfully or return a character
+  encoding.</p>
+
+  <ol><li>
+
+    <p>Let <var title="">position</var> be a pointer to a byte in the
+    input stream, initially pointing at the first byte. If at any
+    point during these steps the user agent either runs out of bytes
+    or reaches its <var title="">end condition</var>, then abort the
+    <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its encoding</a>
+    algorithm unsuccessfully.</p>
+
+   </li>
+
+   <li>
+
+    <p><i>Loop</i>: If <var title="">position</var> points to:</p>
+
+    <dl class="switch"><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '&lt;!--')</dt>
+     <dd>
+
+      <p>Advance the <var title="">position</var> pointer so that it
+      points at the first 0x3E byte which is preceded by two 0x2D
+      bytes (i.e. at the end of an ASCII '--&gt;' sequence) and comes
+      after the 0x3C byte that was found. (The two 0x2D bytes can be
+      the same as the those in the '&lt;!--' sequence.)</p>
+
+     </dd>
+
+     <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '&lt;meta' followed by a space or slash)</dt>
+     <dd>
+
+      <ol><li><p>Advance the <var title="">position</var> pointer so
+       that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
+       0x2F byte (the one in sequence of characters matched
+       above).</li>
+
+       <li><p>Let <var title="">attribute list</var> be an empty
+       list of strings.</li> 
+
+       <li><p>Let <var title="">got pragma</var> be false.</li>
+
+       <li><p>Let <var title="">need pragma</var> be null.</li>
+
+       <li><p>Let <var title="">charset</var> be the null value
+       (which, for the purposes of this algorithm, is distinct from
+       an unrecognised encoding or the empty string).</li>
+
+       <li><p><i>Attributes</i>: <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">Get an
+       attribute</a> and its value. If no attribute was sniffed,
+       then jump to the <i>processing</i> step below.</li>
+
+       <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
+       labeled <i>attributes</i>.</p>
+
+       <li><p>Add the attribute's name to <var title="">attribute
+       list</var>.</p>
+
+       <li>
+
+        <p>Run the appropriate step from the following list, if one
+        applies:</p>
+
+        <dl class="switch"><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
+
+         <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
+         pragma</var> to true.</dd>
+
+         <dt>If the attribute's name is "<code title="">content</code>"</dt>
+
+         <dd><p>Apply the <a href="#algorithm-for-extracting-an-encoding-from-a-meta-element">algorithm for extracting an encoding
+         from a <code>meta</code> element</a>, giving the
+         attribute's value as the string to parse. If an encoding is
+         returned, and if <var title="">charset</var> is still set
+         to null, let <var title="">charset</var> be the encoding
+         returned, and set <var title="">need pragma</var> to
+         true.</dd>
+
+         <dt>If the attribute's name is "<code title="">charset</code>"</dt>
+
+         <dd><p>Let <var title="">charset</var> be the encoding
+         corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd>
+
+        </dl></li>
+
+       <li><p>Return to the step labeled <i>attributes</i>.</li>
+
+       <li><p><i>Processing</i>: If <var title="">need pragma</var> is
+       null, then jump to the step below labeled <i>next
+       byte</i>.</li>
+
+       <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the step below
+       labeled <i>next byte</i>.</li>
+
+       <li><p>If <var title="">charset</var> is <a href="#a-utf-16-encoding">a UTF-16
+       encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
+
+       <li><p>If <var title="">charset</var> is not a supported
+       character encoding, then jump to the step below labeled <i>next
+       byte</i>.</li>
+
+       <li><p>Abort the <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its
+       encoding</a> algorithm, returning the encoding given by <var title="">charset</var>.</li>
+
+      </ol></dd>
+
+     <dt>A sequence of bytes starting with a 0x3C byte (ASCII &lt;), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
+     <dd>
+
+      <ol><li><p>Advance the <var title="">position</var> pointer so
+       that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
+       0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
+       (ASCII &gt;) byte.</li>
+
+       <li><p>Repeatedly <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+       attribute</a> until no further attributes can be found, then
+       jump to the step below labeled <i>next byte</i>.</li>
+
+      </ol></dd>
+
+     <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '&lt;!')</dt>
+     <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '&lt;/')</dt>
+     <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '&lt;?')</dt>
+     <dd>
+
+      <p>Advance the <var title="">position</var> pointer so that it
+      points at the first 0x3E byte (ASCII &gt;) that comes after the
+      0x3C byte that was found.</p>
+
+     </dd>
+
+     <dt>Any other byte</dt>
+     <dd>
+
+      <p>Do nothing with that byte.</p>
+
+     </dd>
+
+    </dl></li>
+
+   <li><i>Next byte</i>: Move <var title="">position</var> so it
+   points at the next byte in the input stream, and return to the step
+   above labeld <i>loop</i>.</li>
+
+  </ol><p>When the <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its
+  encoding</a> algorithm says to <dfn id="concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an attribute</dfn>,
+  it means doing this:</p>
+
+  <ol><li><p>If the byte at <var title="">position</var> is one of 0x09
+   (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
+   0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
+   step.</li>
+
+   <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
+   &gt;), then abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+   attribute</a> algorithm. There isn't one.</li>
+
+   <li><p>Otherwise, the byte at <var title="">position</var> is the
+   start of the attribute name. Let <var title="">attribute name</var>
+   and <var title="">attribute value</var> be the empty
+   string.</li>
+
+   <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
+
+    <dl class="switch"><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
+     name</var> is longer than the empty string</dt>
+
+     <dd>Advance <var title="">position</var> to the next byte and
+     jump to the step below labeled <i>value</i>.</dd>
+
+     <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+     FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
+
+     <dd>Jump to the step below labeled <i>spaces</i>.</dd>
+
+     <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII &gt;)</dt>
+
+     <dd>Abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+     attribute</a> algorithm. The attribute's name is the value of
+     <var title="">attribute name</var>, its value is the empty
+     string.</dd>
+
+     <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+     Z)</dt>
+
+     <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is
+     the value of the byte at <var title="">position</var>). (This
+     converts the input to lowercase.)</dd>
+
+     <dt>Anything else</dt>
+
+     <dd>Append the Unicode character with the same code point as the
+     value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
+     bytes outside the ASCII range are handled here, since only
+     ASCII characters can contribute to the detection of a character
+     encoding.)</dd>
+
+    </dl></li>
+
+   <li><p>Advance <var title="">position</var> to the next byte and
+   return to the previous step.</li>
+
+   <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+   LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+   advance <var title="">position</var> to the next byte, then,
+   repeat this step.</li>
+
+   <li><p>If the byte at <var title="">position</var> is <em>not</em>
+   0x3D (ASCII =), abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+   attribute</a> algorithm. The attribute's name is the value of
+   <var title="">attribute name</var>, its value is the empty
+   string.</li>
+
+   <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
+   =) byte.</li>
+
+   <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+   LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+   advance <var title="">position</var> to the next byte, then,
+   repeat this step.</li>
+
+   <li><p>Process the byte at <var title="">position</var> as
+   follows:</p>
+
+    <dl class="switch"><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
+
+     <dd>
+
+      <ol><li>Let <var title="">b</var> be the value of the byte at
+       <var title="">position</var>.</li>
+
+       <li><i>Quote loop</i>: Advance <var title="">position</var> to
+       the next byte.</li>
+
+       <li>If the value of the byte at <var title="">position</var> is
+       the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get an
+       attribute" algorithm. The attribute's name is the value of <var title="">attribute name</var>, and its value is the value of
+       <var title="">attribute value</var>.</li>
+
+       <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to 0x5A
+       (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
+       than the value of the byte at <var title="">position</var>.</li>
+
+       <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
+       the value of the byte at <var title="">position</var>.</li>
+
+       <li>Return to the step above labeled <i>quote loop</i>.</li>
+
+      </ol></dd>
+
+     <dt>If it is 0x3E (ASCII &gt;)</dt>
+
+     <dd>Abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+     attribute</a> algorithm. The attribute's name is the value of
+     <var title="">attribute name</var>, its value is the empty
+     string.</dd>
+
+
+     <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+     Z)</dt>
+
+     <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is
+     the value of the byte at <var title="">position</var>). Advance
+     <var title="">position</var> to the next byte.</dd>
+
+     <dt>Anything else</dt>
+
+     <dd>Append the Unicode character with the same code point as the
+     value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
+
+    </dl></li>
+
+   <li><p>Process the byte at <var title="">position</var> as
+   follows:</p>
+
+    <dl class="switch"><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+     FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
+     &gt;)</dt>
+
+     <dd>Abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+     attribute</a> algorithm. The attribute's name is the value of
+     <var title="">attribute name</var> and its value is the value of
+     <var title="">attribute value</var>.</dd>
+
+     <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)</dt>
+
+     <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is
+     the value of the byte at <var title="">position</var>).</dd>
+
+     <dt>Anything else</dt>
+
+     <dd>Append the Unicode character with the same code point as the
+     value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
+
+    </dl></li>
+
+   <li><p>Advance <var title="">position</var> to the next byte and
+   return to the previous step.</li>
+
+  </ol><p>For the sake of interoperability, user agents should not use a
+  pre-scan algorithm that returns different results than the one
+  described above. (But, if you do, please at least let us know, so
+  that we can improve this algorithm and benefit everyone...)</p>
+
+
+
 
 
   <h5 id="character-encodings-0"><span class="secno">8.2.2.2 </span>Character encodings</h5>
Received on Monday, 13 February 2012 21:07:24 UTC