White spots in HTML5's encoding sniffing algorithm

May be this is of interest to www-international@: I have today 
published a report on how HTML parsers and XML parsers determine the 
character encoding.


The report concentrates on which encoding signals carries the most 
weight for browsers: User override vs BOM vs HTTP vs <meta> vs XML 
encoding declaration vs character detection vs language default vs 
locale default vs parent browsing context default. And perhaps some 
things I forgot. The data could be relevant in determining a few issues 

Based on those data, I also filed 4 bugs against HTML5:

#1 Encoding Sniffing Algorithm:
   parent browsing context defines encoding default

   PROPOSAL: Add a new, 2nd last step, like so:
     #. If the document lives in a 'nested browsing context',
        then return the encoding of the 'parent browsing context',
        as a parent browsing context dictated default encoding,
        and abort these steps.
        [nested browsing context = iframe etc]

#2 Encoding Sniffing Algorithm:
   Overrides apply to nested browsing contexts

   PROPOSAL: Add a new step after the current first step (about
             user overriding), like so:
     #. If the current document lives in the 'nested browsing
        context'[2] of a document in a 'parent browsing context'
        whose encoding has been overridden at the request of the
        user, then return the encoding of the parent browsing
        context, and abort these steps. 

#3 Encoding Sniffing Algorithm:
   Add an XML check as a step zero

   PROPOSAL: Add this step as a step zero:
     #. If the document is an XML document, abort these steps."
     [Purpose: to avoid that the/an HTML encoding sniffing
      algorithm (sometimes) is applied to XML.]

#4 Encoding Sniffing Algorithm:
   Clarify what "information on the likely encoding" covers

   * E.g. is determining the encoding by, in an HTML document,
     reading the XML encoding declaration, covered by this
     by this step?
Leif Halvard Silli

Received on Wednesday, 25 July 2012 14:15:34 UTC