- From: Ian Hickson via cvs-syncmail <cvsmail@w3.org>
- Date: Mon, 13 Feb 2012 21:07:10 +0000
- To: public-html-commits@w3.org
Update of /sources/public/html5/spec
In directory hutz:/tmp/cvs-serv2558
Modified Files:
Overview.html
Log Message:
Factor out the prescan algorithm for reuse in other specs. (whatwg r6990)
Index: Overview.html
===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.5582
retrieving revision 1.5583
diff -u -d -r1.5582 -r1.5583
--- Overview.html 11 Feb 2012 18:43:03 -0000 1.5582
+++ Overview.html 13 Feb 2012 21:07:05 -0000 1.5583
@@ -320,7 +320,7 @@
<h1>HTML5</h1>
<h2 class="no-num no-toc" id="a-vocabulary-and-associated-apis-for-html-and-xhtml">A vocabulary and associated APIs for HTML and XHTML</h2>
- <h2 class="no-num no-toc" id="editor-s-draft-11-february-2012">Editor's Draft 11 February 2012</h2>
+ <h2 class="no-num no-toc" id="editor-s-draft-13-february-2012">Editor's Draft 13 February 2012</h2>
<dl><dt>Latest Published Version:</dt>
<dd><a href="http://www.w3.org/TR/html5/">http://www.w3.org/TR/html5/</a></dd>
<dt>Latest Editor's Draft:</dt>
@@ -467,7 +467,7 @@
Group</a> is the W3C working group responsible for this
specification's progress along the W3C Recommendation
track.
- This specification is the 11 February 2012 Editor's Draft.
+ This specification is the 13 February 2012 Editor's Draft.
</p><!-- UNDER NO CIRCUMSTANCES IS THE PRECEDING PARAGRAPH TO BE REMOVED OR EDITED WITHOUT TALKING TO IAN FIRST --><p>Work on this specification is also done at the <a href="http://www.whatwg.org/">WHATWG</a>. The W3C HTML working group
actively pursues convergence with the WHATWG, as required by the <a href="http://www.w3.org/2007/03/HTML-WG-charter">W3C HTML working
group charter</a>.</p><!-- UNDER NO CIRCUMSTANCES IS THE FOLLOWING PARAGRAPH TO BE REMOVED OR EDITED WITHOUT TALKING TO IAN FIRST --><p>This document was produced by a group operating under the <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5
@@ -58232,10 +58232,10 @@
parse of the document with the real encoding.</p>
<p id="documentEncoding">User agents must use the following
- algorithm (the <dfn id="encoding-sniffing-algorithm">encoding sniffing algorithm</dfn>) to determine
- the character encoding to use when decoding a document in the first
- pass. This algorithm takes as input any out-of-band metadata
- available to the user agent (e.g. the <a href="#content-type" title="Content-Type">Content-Type metadata</a> of the document)
+ algorithm, called the <dfn id="encoding-sniffing-algorithm">encoding sniffing algorithm</dfn>, to
+ determine the character encoding to use when decoding a document in
+ the first pass. This algorithm takes as input any out-of-band
+ metadata available to the user agent (e.g. the <a href="#content-type" title="Content-Type">Content-Type metadata</a> of the document)
and all the bytes available so far, and returns an encoding and a
<dfn id="concept-encoding-confidence" title="concept-encoding-confidence">confidence</dfn>. The
confidence is either <i>tentative</i>, <i>certain</i>, or
@@ -58271,9 +58271,9 @@
<p class="note">The authoring conformance requirements for
character encoding declarations limit them to only appearing <a href="#charset1024">in the first 1024 bytes</a>. User agents are
- therefore encouraged to use the preparse algorithm below (part of
- these steps) on the first 1024 bytes, but not to stall beyond
- that.</p>
+ therefore encouraged to use the prescan algorithm below (as
+ invoked by these steps) on the first 1024 bytes, but not to stall
+ beyond that.</p>
</li>
@@ -58298,315 +58298,28 @@
</table><p class="note">This step looks for Unicode Byte Order Marks
(BOMs).</li>
- <li><p>Otherwise, the user agent will have to search for explicit
- character encoding information in the file itself. This should
- proceed as follows:
-
- <p>Let <var title="">position</var> be a pointer to a byte in the
- input stream, initially pointing at the first byte. If at any
- point during these substeps the user agent either runs out of
- bytes or decides that scanning further bytes would not be
- efficient, then skip to the next step of the overall character
- encoding detection algorithm. User agents may decide that scanning
- <em>any</em> bytes is not efficient, in which case these substeps
- are entirely skipped.</p>
-
- <p>Now, repeat the following "two" steps until the algorithm
- aborts (either because user agent aborts, as described above, or
- because a character encoding is found):</p>
-
- <ol><li><p>If <var title="">position</var> points to:</p>
-
- <dl class="switch"><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt>
- <dd>
-
- <p>Advance the <var title="">position</var> pointer so that it
- points at the first 0x3E byte which is preceded by two 0x2D
- bytes (i.e. at the end of an ASCII '-->' sequence) and comes
- after the 0x3C byte that was found. (The two 0x2D bytes can be
- the same as the those in the '<!--' sequence.)</p>
-
- </dd>
-
- <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt>
- <dd>
-
- <ol><li><p>Advance the <var title="">position</var> pointer so
- that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
- 0x2F byte (the one in sequence of characters matched
- above).</li>
-
- <li><p>Let <var title="">attribute list</var> be an empty
- list of strings.</li>
-
- <li><p>Let <var title="">got pragma</var> be false.</li>
-
- <li><p>Let <var title="">need pragma</var> be null.</li>
-
- <li><p>Let <var title="">charset</var> be the null value
- (which, for the purposes of this algorithm, is distinct from
- an unrecognised encoding or the empty string).</li>
-
- <li><p><i>Attributes</i>: <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">Get an
- attribute</a> and its value. If no attribute was sniffed,
- then jump to the <i>processing</i> step below.</li>
-
- <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
- labeled <i>attributes</i>.</p>
-
- <li><p>Add the attribute's name to <var title="">attribute
- list</var>.</p>
-
- <li>
-
- <p>Run the appropriate step from the following list, if one
- applies:</p>
-
- <dl class="switch"><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
-
- <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
- pragma</var> to true.</dd>
-
- <dt>If the attribute's name is "<code title="">content</code>"</dt>
-
- <dd><p>Apply the <a href="#algorithm-for-extracting-an-encoding-from-a-meta-element">algorithm for extracting an encoding
- from a <code>meta</code> element</a>, giving the
- attribute's value as the string to parse. If an encoding is
- returned, and if <var title="">charset</var> is still set
- to null, let <var title="">charset</var> be the encoding
- returned, and set <var title="">need pragma</var> to
- true.</dd>
-
- <dt>If the attribute's name is "<code title="">charset</code>"</dt>
-
- <dd><p>Let <var title="">charset</var> be the encoding
- corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd>
-
- </dl></li>
-
- <li><p>Return to the step labeled <i>attributes</i>.</li>
-
- <li><p><i>Processing</i>: If <var title="">need pragma</var>
- is null, then jump to the second step of the overall "two
- step" algorithm.</li>
-
- <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the second
- step of the overall "two step" algorithm.</li>
-
- <li><p>If <var title="">charset</var> is <a href="#a-utf-16-encoding">a UTF-16
- encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
-
- <li><p>If <var title="">charset</var> is not a supported
- character encoding, then jump to the second step of the
- overall "two step" algorithm.</li>
-
- <li><p>Return the encoding given by <var title="">charset</var>, with <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
- <i>tentative</i>, and abort all these steps.</li>
-
- </ol></dd>
-
- <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
- <dd>
-
- <ol><li><p>Advance the <var title="">position</var> pointer so
- that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
- 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
- (ASCII >) byte.</li>
-
- <li><p>Repeatedly <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
- attribute</a> until no further attributes can be found,
- then jump to the second step in the overall "two step"
- algorithm.</li>
-
- </ol></dd>
-
- <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt>
- <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt>
- <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt>
- <dd>
-
- <p>Advance the <var title="">position</var> pointer so that it
- points at the first 0x3E byte (ASCII >) that comes after the
- 0x3C byte that was found.</p>
-
- </dd>
-
- <dt>Any other byte</dt>
- <dd>
-
- <p>Do nothing with that byte.</p>
-
- </dd>
-
- </dl></li>
-
- <li>Move <var title="">position</var> so it points at the next
- byte in the input stream, and return to the first step of this
- "two step" algorithm.</li>
-
- </ol><p>When the above "two step" algorithm says to <dfn id="concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
- attribute</dfn>, it means doing this:</p>
-
- <ol><li><p>If the byte at <var title="">position</var> is one of 0x09
- (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
- 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
- substep.</li>
-
- <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
- >), then abort the "get an attribute" algorithm. There isn't
- one.</li>
-
- <li><p>Otherwise, the byte at <var title="">position</var> is the
- start of the attribute name. Let <var title="">attribute
- name</var> and <var title="">attribute value</var> be the empty
- string.</li>
-
- <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
-
- <dl class="switch"><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
- name</var> is longer than the empty string</dt>
-
- <dd>Advance <var title="">position</var> to the next byte and
- jump to the step below labeled <i>value</i>.</dd>
-
- <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
- FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
-
- <dd>Jump to the step below labeled <i>spaces</i>.</dd>
-
- <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var>, its
- value is the empty string.</dd>
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is
- the value of the byte at <var title="">position</var>). (This
- converts the input to lowercase.)</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
- bytes outside the ASCII range are handled here, since only
- ASCII characters can contribute to the detection of a character
- encoding.)</dd>
-
- </dl></li>
-
- <li><p>Advance <var title="">position</var> to the next byte and
- return to the previous step.</li>
-
- <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
- LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
- advance <var title="">position</var> to the next byte, then,
- repeat this step.</li>
-
- <li><p>If the byte at <var title="">position</var> is
- <em>not</em> 0x3D (ASCII =), abort the "get an attribute"
- algorithm. The attribute's name is the value of <var title="">attribute name</var>, its value is the empty
- string.</li>
-
- <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
- =) byte.</li>
-
- <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
- LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
- advance <var title="">position</var> to the next byte, then,
- repeat this step.</li>
-
- <li><p>Process the byte at <var title="">position</var> as
- follows:</p>
-
- <dl class="switch"><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
-
- <dd>
-
- <ol><li>Let <var title="">b</var> be the value of the byte at
- <var title="">position</var>.</li>
-
- <li>Advance <var title="">position</var> to the next
- byte.</li>
-
- <li>If the value of the byte at <var title="">position</var>
- is the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get
- an attribute" algorithm. The attribute's name is the value of
- <var title="">attribute name</var>, and its value is the
- value of <var title="">attribute value</var>.</li>
-
- <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to
- 0x5A (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
- than the value of the byte at <var title="">position</var>.</li>
-
- <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
- the value of the byte at <var title="">position</var>.</li>
-
- <li>Return to the second step in these substeps.</li>
-
- </ol></dd>
-
- <dt>If it is 0x3E (ASCII >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var>, its
- value is the empty string.</dd>
-
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
- value</var> (where <var title="">b</var> is the value of the
- byte at <var title="">position</var>). Advance <var title="">position</var> to the next byte.</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
-
- </dl></li>
-
- <li><p>Process the byte at <var title="">position</var> as
- follows:</p>
-
- <dl class="switch"><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
- FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
- >)</dt>
-
- <dd>Abort the "get an attribute" algorithm. The attribute's
- name is the value of <var title="">attribute name</var> and its
- value is the value of <var title="">attribute value</var>.</dd>
-
- <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
- Z)</dt>
-
- <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute
- value</var> (where <var title="">b</var> is the value of the
- byte at <var title="">position</var>).</dd>
-
- <dt>Anything else</dt>
-
- <dd>Append the Unicode character with the same code point as the
- value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
-
- </dl></li>
+ <li>
- <li><p>Advance <var title="">position</var> to the next byte and
- return to the previous step.</li>
+ <p>Otherwise, optionally <a href="#prescan-a-byte-stream-to-determine-its-encoding" title="prescan a byte stream to
+ determine its encoding">prescan the byte stream to determine its
+ encoding</a>. The <var title="">end condition</var> is that the
+ user agent decides that scanning further bytes would not be
+ efficient. User agents are encouraged to only prescan the first
+ 1024 bytes. User agents may decide that scanning <em>any</em>
+ bytes is not efficient, in which case these substeps are entirely
+ skipped.</p>
- </ol><p>For the sake of interoperability, user agents should not use a
- pre-scan algorithm that returns different results than the one
- described above. (But, if you do, please at least let us know, so
- that we can improve this algorithm and benefit everyone...)</p>
+ <p>The aforementioned algorithm either aborts unsuccessfully or
+ returns a character encoding. If it returns a character encoding,
+ then this algorithm must be aborted, returning the same encoding,
+ with <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
+ <i>tentative</i>.</p>
</li>
- <li><p>If the user agent has information on the likely encoding for
- this page, e.g. based on the encoding of the page when it was last
- visited, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
+ <li><p>Otherwise, if the user agent has information on the likely
+ encoding for this page, e.g. based on the encoding of the page when
+ it was last visited, then return that encoding, with the <a href="#concept-encoding-confidence" title="concept-encoding-confidence">confidence</a>
<i>tentative</i>, and abort these steps.</li>
<li>
@@ -58749,6 +58462,314 @@
as the user agent uses the returned value to select the decoder to
use for the input stream.</p>
+ <hr><p>When an algorithm requires a user agent to <dfn id="prescan-a-byte-stream-to-determine-its-encoding">prescan a byte
+ stream to determine its encoding</dfn>, given some defined <var title="">end condition</var>, then it must run the following steps.
+ These steps either abort unsuccessfully or return a character
+ encoding.</p>
+
+ <ol><li>
+
+ <p>Let <var title="">position</var> be a pointer to a byte in the
+ input stream, initially pointing at the first byte. If at any
+ point during these steps the user agent either runs out of bytes
+ or reaches its <var title="">end condition</var>, then abort the
+ <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its encoding</a>
+ algorithm unsuccessfully.</p>
+
+ </li>
+
+ <li>
+
+ <p><i>Loop</i>: If <var title="">position</var> points to:</p>
+
+ <dl class="switch"><dt>A sequence of bytes starting with: 0x3C 0x21 0x2D 0x2D (ASCII '<!--')</dt>
+ <dd>
+
+ <p>Advance the <var title="">position</var> pointer so that it
+ points at the first 0x3E byte which is preceded by two 0x2D
+ bytes (i.e. at the end of an ASCII '-->' sequence) and comes
+ after the 0x3C byte that was found. (The two 0x2D bytes can be
+ the same as the those in the '<!--' sequence.)</p>
+
+ </dd>
+
+ <dt>A sequence of bytes starting with: 0x3C, 0x4D or 0x6D, 0x45 or 0x65, 0x54 or 0x74, 0x41 or 0x61, and one of 0x09, 0x0A, 0x0C, 0x0D, 0x20, 0x2F (case-insensitive ASCII '<meta' followed by a space or slash)</dt>
+ <dd>
+
+ <ol><li><p>Advance the <var title="">position</var> pointer so
+ that it points at the next 0x09, 0x0A, 0x0C, 0x0D, 0x20, or
+ 0x2F byte (the one in sequence of characters matched
+ above).</li>
+
+ <li><p>Let <var title="">attribute list</var> be an empty
+ list of strings.</li>
+
+ <li><p>Let <var title="">got pragma</var> be false.</li>
+
+ <li><p>Let <var title="">need pragma</var> be null.</li>
+
+ <li><p>Let <var title="">charset</var> be the null value
+ (which, for the purposes of this algorithm, is distinct from
+ an unrecognised encoding or the empty string).</li>
+
+ <li><p><i>Attributes</i>: <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">Get an
+ attribute</a> and its value. If no attribute was sniffed,
+ then jump to the <i>processing</i> step below.</li>
+
+ <li><p>If the attribute's name is already in <var title="">attribute list</var>, then return to the step
+ labeled <i>attributes</i>.</p>
+
+ <li><p>Add the attribute's name to <var title="">attribute
+ list</var>.</p>
+
+ <li>
+
+ <p>Run the appropriate step from the following list, if one
+ applies:</p>
+
+ <dl class="switch"><dt>If the attribute's name is "<code title="">http-equiv</code>"</dt>
+
+ <dd><p>If the attribute's value is "<code title="">content-type</code>", then set <var title="">got
+ pragma</var> to true.</dd>
+
+ <dt>If the attribute's name is "<code title="">content</code>"</dt>
+
+ <dd><p>Apply the <a href="#algorithm-for-extracting-an-encoding-from-a-meta-element">algorithm for extracting an encoding
+ from a <code>meta</code> element</a>, giving the
+ attribute's value as the string to parse. If an encoding is
+ returned, and if <var title="">charset</var> is still set
+ to null, let <var title="">charset</var> be the encoding
+ returned, and set <var title="">need pragma</var> to
+ true.</dd>
+
+ <dt>If the attribute's name is "<code title="">charset</code>"</dt>
+
+ <dd><p>Let <var title="">charset</var> be the encoding
+ corresponding to the attribute's value, and set <var title="">need pragma</var> to false.</dd>
+
+ </dl></li>
+
+ <li><p>Return to the step labeled <i>attributes</i>.</li>
+
+ <li><p><i>Processing</i>: If <var title="">need pragma</var> is
+ null, then jump to the step below labeled <i>next
+ byte</i>.</li>
+
+ <li><p>If <var title="">need pragma</var> is true but <var title="">got pragma</var> is false, then jump to the step below
+ labeled <i>next byte</i>.</li>
+
+ <li><p>If <var title="">charset</var> is <a href="#a-utf-16-encoding">a UTF-16
+ encoding</a>, change the value of <var title="">charset</var> to UTF-8.</li>
+
+ <li><p>If <var title="">charset</var> is not a supported
+ character encoding, then jump to the step below labeled <i>next
+ byte</i>.</li>
+
+ <li><p>Abort the <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its
+ encoding</a> algorithm, returning the encoding given by <var title="">charset</var>.</li>
+
+ </ol></dd>
+
+ <dt>A sequence of bytes starting with a 0x3C byte (ASCII <), optionally a 0x2F byte (ASCII /), and finally a byte in the range 0x41-0x5A or 0x61-0x7A (an ASCII letter)</dt>
+ <dd>
+
+ <ol><li><p>Advance the <var title="">position</var> pointer so
+ that it points at the next 0x09 (ASCII TAB), 0x0A (ASCII LF),
+ 0x0C (ASCII FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E
+ (ASCII >) byte.</li>
+
+ <li><p>Repeatedly <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+ attribute</a> until no further attributes can be found, then
+ jump to the step below labeled <i>next byte</i>.</li>
+
+ </ol></dd>
+
+ <dt>A sequence of bytes starting with: 0x3C 0x21 (ASCII '<!')</dt>
+ <dt>A sequence of bytes starting with: 0x3C 0x2F (ASCII '</')</dt>
+ <dt>A sequence of bytes starting with: 0x3C 0x3F (ASCII '<?')</dt>
+ <dd>
+
+ <p>Advance the <var title="">position</var> pointer so that it
+ points at the first 0x3E byte (ASCII >) that comes after the
+ 0x3C byte that was found.</p>
+
+ </dd>
+
+ <dt>Any other byte</dt>
+ <dd>
+
+ <p>Do nothing with that byte.</p>
+
+ </dd>
+
+ </dl></li>
+
+ <li><i>Next byte</i>: Move <var title="">position</var> so it
+ points at the next byte in the input stream, and return to the step
+ above labeld <i>loop</i>.</li>
+
+ </ol><p>When the <a href="#prescan-a-byte-stream-to-determine-its-encoding">prescan a byte stream to determine its
+ encoding</a> algorithm says to <dfn id="concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an attribute</dfn>,
+ it means doing this:</p>
+
+ <ol><li><p>If the byte at <var title="">position</var> is one of 0x09
+ (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII FF), 0x0D (ASCII CR),
+ 0x20 (ASCII space), or 0x2F (ASCII /) then advance <var title="">position</var> to the next byte and redo this
+ step.</li>
+
+ <li><p>If the byte at <var title="">position</var> is 0x3E (ASCII
+ >), then abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+ attribute</a> algorithm. There isn't one.</li>
+
+ <li><p>Otherwise, the byte at <var title="">position</var> is the
+ start of the attribute name. Let <var title="">attribute name</var>
+ and <var title="">attribute value</var> be the empty
+ string.</li>
+
+ <li><p><i>Attribute name</i>: Process the byte at <var title="">position</var> as follows:</p>
+
+ <dl class="switch"><dt>If it is 0x3D (ASCII =), and the <var title="">attribute
+ name</var> is longer than the empty string</dt>
+
+ <dd>Advance <var title="">position</var> to the next byte and
+ jump to the step below labeled <i>value</i>.</dd>
+
+ <dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+ FF), 0x0D (ASCII CR), or 0x20 (ASCII space)</dt>
+
+ <dd>Jump to the step below labeled <i>spaces</i>.</dd>
+
+ <dt>If it is 0x2F (ASCII /) or 0x3E (ASCII >)</dt>
+
+ <dd>Abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</dd>
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+ Z)</dt>
+
+ <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute name</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>). (This
+ converts the input to lowercase.)</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var title="">attribute name</var>. (It doesn't actually matter how
+ bytes outside the ASCII range are handled here, since only
+ ASCII characters can contribute to the detection of a character
+ encoding.)</dd>
+
+ </dl></li>
+
+ <li><p>Advance <var title="">position</var> to the next byte and
+ return to the previous step.</li>
+
+ <li><p><i>Spaces</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+ LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+ advance <var title="">position</var> to the next byte, then,
+ repeat this step.</li>
+
+ <li><p>If the byte at <var title="">position</var> is <em>not</em>
+ 0x3D (ASCII =), abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</li>
+
+ <li><p>Advance <var title="">position</var> past the 0x3D (ASCII
+ =) byte.</li>
+
+ <li><p><i>Value</i>: If the byte at <var title="">position</var> is one of 0x09 (ASCII TAB), 0x0A (ASCII
+ LF), 0x0C (ASCII FF), 0x0D (ASCII CR), or 0x20 (ASCII space) then
+ advance <var title="">position</var> to the next byte, then,
+ repeat this step.</li>
+
+ <li><p>Process the byte at <var title="">position</var> as
+ follows:</p>
+
+ <dl class="switch"><dt>If it is 0x22 (ASCII ") or 0x27 (ASCII ')</dt>
+
+ <dd>
+
+ <ol><li>Let <var title="">b</var> be the value of the byte at
+ <var title="">position</var>.</li>
+
+ <li><i>Quote loop</i>: Advance <var title="">position</var> to
+ the next byte.</li>
+
+ <li>If the value of the byte at <var title="">position</var> is
+ the value of <var title="">b</var>, then advance <var title="">position</var> to the next byte and abort the "get an
+ attribute" algorithm. The attribute's name is the value of <var title="">attribute name</var>, and its value is the value of
+ <var title="">attribute value</var>.</li>
+
+ <li>Otherwise, if the value of the byte at <var title="">position</var> is in the range 0x41 (ASCII A) to 0x5A
+ (ASCII Z), then append a Unicode character to <var title="">attribute value</var> whose code point is 0x20 more
+ than the value of the byte at <var title="">position</var>.</li>
+
+ <li>Otherwise, append a Unicode character to <var title="">attribute value</var> whose code point is the same as
+ the value of the byte at <var title="">position</var>.</li>
+
+ <li>Return to the step above labeled <i>quote loop</i>.</li>
+
+ </ol></dd>
+
+ <dt>If it is 0x3E (ASCII >)</dt>
+
+ <dd>Abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var>, its value is the empty
+ string.</dd>
+
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII
+ Z)</dt>
+
+ <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>). Advance
+ <var title="">position</var> to the next byte.</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var title="">attribute value</var>. Advance <var title="">position</var> to the next byte.</dd>
+
+ </dl></li>
+
+ <li><p>Process the byte at <var title="">position</var> as
+ follows:</p>
+
+ <dl class="switch"><dt>If it is 0x09 (ASCII TAB), 0x0A (ASCII LF), 0x0C (ASCII
+ FF), 0x0D (ASCII CR), 0x20 (ASCII space), or 0x3E (ASCII
+ >)</dt>
+
+ <dd>Abort the <a href="#concept-get-attributes-when-sniffing" title="concept-get-attributes-when-sniffing">get an
+ attribute</a> algorithm. The attribute's name is the value of
+ <var title="">attribute name</var> and its value is the value of
+ <var title="">attribute value</var>.</dd>
+
+ <dt>If it is in the range 0x41 (ASCII A) to 0x5A (ASCII Z)</dt>
+
+ <dd>Append the Unicode character with code point <span title=""><var title="">b</var>+0x20</span> to <var title="">attribute value</var> (where <var title="">b</var> is
+ the value of the byte at <var title="">position</var>).</dd>
+
+ <dt>Anything else</dt>
+
+ <dd>Append the Unicode character with the same code point as the
+ value of the byte at <var title="">position</var>) to <var title="">attribute value</var>.</dd>
+
+ </dl></li>
+
+ <li><p>Advance <var title="">position</var> to the next byte and
+ return to the previous step.</li>
+
+ </ol><p>For the sake of interoperability, user agents should not use a
+ pre-scan algorithm that returns different results than the one
+ described above. (But, if you do, please at least let us know, so
+ that we can improve this algorithm and benefit everyone...)</p>
+
+
+
<h5 id="character-encodings-0"><span class="secno">8.2.2.2 </span>Character encodings</h5>
Received on Monday, 13 February 2012 21:07:13 UTC