[whatwg] Encoding sniffing algorithm - update proposal from Leif Halvard Silli on 2012-07-26 (public-whatwg-archive@w3.org from July 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Fri, 27 Jul 2012 01:27:48 +0300
To: whatwg@lists.whatwg.org
Message-ID: <20120727012748868673.2825876d@xn--mlform-iua.no>
I have just written a document on how implementations prioritize 
encoding info for HTML documents.[1] (As that document shows, I have 
not tested Safari 6.) Based on my findings there, I would like to 
suggest that the spec's encoding sniffing algorithm should be updated 
to look as follows:

Revised encoding sniffing algorithm proposal:

NEW! 0. document is XML format - opt out of the algorithm.
        [This step is already implicit in the spec, but it would
        make sense to explicitly include it to make sure that
        one could e.g. write test cases to see that it is step
        is implemented. Currently Safari, Chrome and Opera do 
        not 100% implement this step.]
         
NEW! #. Alternative: The BOM signature could go here instead of 
        in step 5. There is a bug to move the BOM hereto and make
        it override anything else. What speaks against this are:
          a) that Firefox, IE10 and Opera do not currently have
             this behavior.
          b) this revision of the sniffing algorithm, especially
             the revision in step 6 (required UTF-8 detection),
             might make the BOM-trumps-everything-else override
             less necessary
        What speaks for this override:
          a) Safari, Chrome and legacy IE implement it.
          b) some legacy content may depend on it

     1. user override.
        (PS: The spec should clarify whether user override is
             cacheable.)

NEW! 2. iframe inherits user override from parent browsing context
        [Currently not mentioned in the spec, despite that "all"
         UAs do have this step for HTML docs.]

     3. explicit charset attribute in Content-Type header.

     4. BOM signature [or as the second step, see above]

     5. native markup label <meta charset=UTF-8>

NEW! 6. UTF-8 detection.
        I think we should separate UTF-8 detection from other
        detection in order to make this step obligatory.
        The newness here is only the limitation to UTF-8
        detection plus that it should be obligatory. 
        (Thus: If it is not detected as UTF-8, then
        the parser proceeds to next step in the algorithm.)
        This step would make browsers lean more strongly 
        towards UTF-8.

NEW! 7. parent browsing context default.
        The current spec does not mention this step at all,
        despite that both Opera, IE, Safari, Chrome, Firefox
        do implement it.

        Regarding 6. and 7., then the order is important. Chrome
        does for instance perform UTF-8 detection, but it does it
        only /after/ the parent browsing context. Whereas everyone
        else (Opera 12 by default, Firefox for some locales - don't
        know if there are others) let it happen before the 'parent
        browsing context default'.

NEW! 8. info on “the likely encoding”
        The main newness is that this step is placed _after_ 
        the (revised) UTF-8 detection and after the (new) parent
        browsing context default.
        The name 'the likely encoding' is from the current spec
        text. I am a bit uncertain about what it means in the 
        current spec, though. So I move here what I think make
        sense. The steps under this point should perhaps be
        optional:

        a. detection of other charsets than UTF-8
           (e.g the optional Cyrillic detection in
           Firefox or legacy Asian encoding detection.
           The actual detection might happen in step 6,
           but it should only be made to count here.)
        b. markup label of the sister language
           <?xml version="1.0" encoding="UTF-8"?>
           (Opera/Webkit/Chrome currently have this directly
           after the native encoding label step - step 5.
        c. Other things? What does "likely encoding" current
           refer to, exactly?

     9. locale default

[1] 
http://malform.no/blog/white-spots-in-html5-s-encoding-sniffing-algorithm

[2] To the question of whether the BOM should trump everything else, 
then I think it it would be more important to get the other parts of 
this algorithm right. If we do get the rest of it right, then the 'BOM 
should trump' argument, becomes less important.
-- 
Leif Halvard Silli
Received on Thursday, 26 July 2012 22:28:29 UTC