html5/spec Overview.html,1.1117,1.1118 from Ian Hickson via cvs-syncmail on 2008-07-24 (public-html-commits@w3.org from July 2008)

From: Ian Hickson via cvs-syncmail <cvsmail@w3.org>
Date: Thu, 24 Jul 2008 03:07:03 +0000
To: public-html-commits@w3.org
Message-Id: <E1KLrAV-0006E9-Cb@lionel-hutz.w3.org>
Update of /sources/public/html5/spec
In directory hutz:/tmp/cvs-serv23748

Modified Files:
	Overview.html 
Log Message:
Make content-sniffing 'better': make the text/binary case actually work out what the binary data might be; make the unknown type case determine the text/plain cases as a first-class citizen instead of falling back on the text/binary algorithm; fix minor grammatical things. (whatwg r1927)

Index: Overview.html
===================================================================
RCS file: /sources/public/html5/spec/Overview.html,v
retrieving revision 1.1117
retrieving revision 1.1118
diff -u -d -r1.1117 -r1.1118
--- Overview.html	24 Jul 2008 02:28:55 -0000	1.1117
+++ Overview.html	24 Jul 2008 03:07:00 -0000	1.1118
@@ -6175,8 +6175,8 @@
      of bytes already available.
 
    <li>
-    <p>If <var title="">n</var> is 4 or more, and the first bytes of the file
-     match one of the following byte sets:</p>
+    <p>If <var title="">n</var> is 4 or more, and the first bytes of the
+     resource match one of the following byte sets:</p>
 
     <table>
      <thead>
@@ -6222,36 +6222,49 @@
         
     </table>
 
-    <p>...then the sniffed type of the resource is "text/plain".</p>
+    <p>...then the sniffed type of the resource is "text/plain". Abort these
+     steps.</p>
 
    <li>
-    <p>Otherwise, if any of the first <var title="">n</var> bytes of the
-     resource are in one of the following byte ranges:</p>
-    <!-- This byte list is based on RFC 2046 Section 4.1.2. Characters
-    in the range 0x00-0x1F, with the exception of 0x09, 0x0A, 0x0C,
-    0x0D (ASCII for TAB, LF, FF, and CR), and character 0x1B
-    (reportedly used by some encodings as a shift escape), are
-    invalid. Thus, if we see them, we assume it's not text. -->
-    
-    <ul class=brief>
-     <li> 0x00 - 0x08
+    <p>If none of the first <var title="">n</var> bytes of the resource are
+     <a href="#binary">binary data bytes</a> then the sniffed type of the
+     resource is "text/plain". Abort these steps.
 
-     <li> 0x0B
+   <li>
+    <p>If the first bytes of the resource match one of the byte sequences in
+     the "pattern" column of the table in the <i title="content-type
+     sniffing: unknown type"><a href="#content-type7">unknown type</a></i>
+     section below, ignoring any rows whose cell in the "security" column
+     says "scriptable" (or "n/a"), then the sniffed type of the resource is
+     the type given in the corresponding cell in the "sniffed type" column on
+     that row; abort these steps.</p>
 
-     <li> 0x0E - 0x1A
+    <p class=warning>It is critical that this step not ever return a
+     scriptable type (e.g. text/html), as otherwise that would allow a
+     privilege escalation attack.</p>
 
-     <li> 0x1C - 0x1F
-    </ul>
+   <li>
+    <p>Otherwise, the sniffed type of the resource is
+     "application/octet-stream".
+  </ol>
 
-    <p>...then the sniffed type of the resource is
-     "application/octet-stream".</p>
+  <p>Bytes covered by the following ranges are <dfn id=binary>binary data
+   bytes</dfn>:</p>
+  <!-- This byte list is based on RFC 2046 Section 4.1.2. Characters
+  in the range 0x00-0x1F, with the exception of 0x09, 0x0A, 0x0C, 0x0D
+  (ASCII for TAB, LF, FF, and CR), and character 0x1B (reportedly used
+  by some encodings as a shift escape), are invalid. Thus, if we see
+  them, we assume it's not text. -->
 
-    <p class=big-issue>maybe we should invoke the "Content-Type sniffing:
-     image" section now, falling back on "application/octet-stream".</p>
+  <ul class=brief>
+   <li> 0x00 - 0x08
 
-   <li>
-    <p>Otherwise, the sniffed type of the resource is "text/plain".
-  </ol>
+   <li> 0x0B
+
+   <li> 0x0E - 0x1A
+
+   <li> 0x1C - 0x1F
+  </ul>
 
   <h4 id=content-type2><span class=secno>2.7.4 </span><dfn
    id=content-type7>Content-Type sniffing: unknown type</dfn></h4>
@@ -6359,11 +6372,17 @@
     </dl>
 
    <li>
-    <p>As a last-ditch effort, jump to the <a href="#content-type6"
-     title="content-type sniffing: text or binary">text or binary</a>
-     section.
+    <p>If none of the first <var title="">n</var> bytes of the resource are
+     <a href="#binary">binary data bytes</a> then the sniffed type of the
+     resource is "text/plain". Abort these steps.
+
+   <li>
+    <p>Otherwise, the sniffed type of the resource is
+     "application/octet-stream".
   </ol>
 
+  <p>The table used by the above algorithm is:
+
   <table>
    <thead>
     <tr>
@@ -6371,6 +6390,8 @@
 
      <th rowspan=2>Sniffed type
 
+     <th rowspan=2>Security
+
      <th rowspan=2>Comment
 
     <tr>
@@ -6387,6 +6408,8 @@
 
      <td>text/html
 
+     <td>Scriptable
+
      <td>The string "<code title="">&lt;!DOCTYPE HTML</code>" in US-ASCII or
       compatible encodings, case-insensitively.
 
@@ -6398,6 +6421,8 @@
 
      <td>text/html
 
+     <td>Scriptable
+
      <td>The string "<code title="">&lt;HTML</code>" in US-ASCII or
       compatible encodings, case-insensitively, possibly with leading spaces.
       
@@ -6410,6 +6435,8 @@
 
      <td>text/html
 
+     <td>Scriptable
+
      <td>The string "<code title="">&lt;HEAD</code>" in US-ASCII or
       compatible encodings, case-insensitively, possibly with leading spaces.
       
@@ -6422,6 +6449,8 @@
 
      <td>text/html
 
+     <td>Scriptable
+
      <td>The string "<code title="">&lt;SCRIPT</code>" in US-ASCII or
       compatible encodings, case-insensitively, possibly with leading spaces.
       
@@ -6435,6 +6464,8 @@
 
      <td>application/pdf
 
+     <td>Scriptable
+
      <td>The string "<code title="">%PDF-</code>", the PDF signature.
 
     <tr>
@@ -6446,8 +6477,45 @@
 
      <td>application/postscript
 
+     <td>Safe
+
      <td>The string "<code title="">%!PS-Adobe-</code>", the PostScript
-      signature. <!-- copied from the section below -->
+      signature. <!-- copied from the text or binary section above -->
+
+   <tbody>
+    <tr>
+     <td>FF FF 00 00
+
+     <td>FE FF 00 00
+
+     <td>text/plain
+
+     <td>n/a
+
+     <td>UTF-16BE BOM <!-- followed by at least one character -->
+
+    <tr>
+     <td>FF FF 00 00
+
+     <td>FF FF 00 00
+
+     <td>text/plain
+
+     <td>n/a
+
+     <td>UTF-16LE BOM <!-- followed by at least one character -->
+
+    <tr>
+     <td>FF FF FF 00
+
+     <td>EF BB BF 00
+
+     <td>text/plain
+
+     <td>n/a
+
+     <td>UTF-8 BOM <!-- followed by at least one character -->
+      <!-- based on the table in the image section below -->
 
    <tbody>
     <tr>
@@ -6457,6 +6525,8 @@
 
      <td>image/gif
 
+     <td>Safe
+
      <td>The string "<code title="">GIF87a</code>", a GIF signature.
 
     <tr>
@@ -6466,6 +6536,8 @@
 
      <td>image/gif
 
+     <td>Safe
+
      <td>The string "<code title="">GIF89a</code>", a GIF signature.
 
     <tr>
@@ -6476,6 +6548,8 @@
 
      <td>image/png
 
+     <td>Safe
+
      <td>The PNG signature.
 
     <tr>
@@ -6486,6 +6560,8 @@
 
      <td>image/jpeg
 
+     <td>Safe
+
      <td>A JPEG SOI marker followed by the first byte of another marker.
 
     <tr>
@@ -6495,6 +6571,8 @@
 
      <td>image/bmp
 
+     <td>Safe
+
      <td>The string "<code title="">BM</code>", a BMP signature.
 
     <tr>
@@ -6504,10 +6582,15 @@
 
      <td>image/vnd.microsoft.icon
 
+     <td>Safe
+
      <td>A 0 word following by a 1 word, a Windows Icon file format
       signature.
   </table>
 
+  <p class=big-issue>I'd like to add types like MPEG, AVI, Flash, Java, etc,
+   to the above table.
+
   <p>User agents may support further types if desired, by implicitly adding
    to the above table. However, user agents should not use any other patterns
    for types already mentioned in the table above, as this could then be used
@@ -6515,11 +6598,15 @@
    determine that content is not HTML and thus safe from XSS attacks, but
    then a user agent detects it as HTML anyway and allows script to execute).
 
+  <p>The column marked "security" is used by the algorithm in the "text or
+   binary" section, to avoid sniffing <code title="">text/plain</code>
+   content as a type that can be used for a privilege escalation attack.
+
   <h4 id=content-type3><span class=secno>2.7.5 </span><dfn
    id=content-type8>Content-Type sniffing: image</dfn></h4>
 
-  <p>If the first bytes of the file match one of the byte sequences in the
-   first columns of the following table, then the sniffed type of the
+  <p>If the first bytes of the resource match one of the byte sequences in
+   the first column of the following table, then the sniffed type of the
    resource is the type given in the corresponding cell in the second column
    on the same row:
Received on Thursday, 24 July 2008 03:07:38 UTC