Re: several messages about handling encodings in HTML from Ian Hickson on 2008-02-29 (public-html@w3.org from February 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 29 Feb 2008 01:21:20 +0000 (UTC)
To: whatwg@whatwg.org, HTML WG <public-html@w3.org>, public-i18n-core@w3.org
Message-ID: <Pine.LNX.4.62.0802271006500.6407@hixie.dreamhostps.com>
Executive summary: I made a number of changes, as described below, in 
response to the feedback on character encodings in HTML. They are covered 
by revisions 1263 to 1275 of the spec source.

I have cc'ed most (though not all) of the mailing lists that were 
originally cc'ed on the messages to which I reply below, to keep everyone 
in the loop. Please, for everyone's sake, pick a single mailing list when 
replying, and trim the quotes to just the bits to which you are replying. 
Don't include the whole of this e-mail in your reply! Thanks.


On Sun, 5 Nov 2006, �istein E. Andersen wrote, in reply to Henri:
> >
> > I think conforming text/html documents should not be allowed to parse 
> > into a DOM that contains characters that are not allowed in XML 1.0. 
> > [...] I am inclined to prefer [...] U+FFFD

(I've made the characters not allowed in XML also not allowed in HTML, 
with the exception of some of the space characters which we need to have 
allowed for legacy reasons.)


> I perfectly agree. (Actually, i think that U+7F (delete) and the C1 
> control characters should be excluded [transformed into U+FFFD] as well, 
> but this could perhaps be problematic due to spurious CP1252 
> characters.)

I've made them illegal but not converted them to FFFD.


On Mon, 6 Nov 2006, Lachlan Hunt wrote:
>
> At the very least, ISO-8859-1 must be treated as Windows-1252.  I'm not 
> sure about the other ISO-8859 encodings.  Numeric and hex character 
> references from 128 to 159 must also be treated as Windows-1252 code 
> points.

All already specified.


On Sun, 5 Nov 2006, Elliotte Harold wrote:
>
> The specific problem is that an author may publish a correctly labeled 
> UTF-8 or ISO-8859-8 document or some such. However the server sends a 
> Content-type header that requires the parser to treat the document as 
> ISO-8859-1 or US-ASCII or something else.
>
> The need is for server administrators to allow content authors to 
> specify content types and character sets for the documents they write. 
> The content doesn't need to change. The authors just need the ability to 
> specify the server headers for their documents.

Well, we can't change the way this works from this side, so it's not 
really our problem at this point.


On Sat, 23 Dec 2006, Henri Sivonen wrote:
>
> http://www.elementary-group-standards.com/web-standards/html5-http-equiv-difference.html
>
> In short, some authors want to use <meta http-equiv="imagetoolbar" 
> content="no"> but (X)HTML5 doesn't allow it.
>
> Personally, I think that authors who want to disable *User* Agent 
> features like that are misguided.
>
> Anyway, I thought I'd mention this so that the issue gets informed as 
> opposed to accidental treatment.

Proprietary extensions to HTML are just that, proprietary extensions, and 
are therefore by intentionally not conforming.


On Mon, 26 Feb 2007, Lachlan Hunt wrote:
>
> Given that the spec now says that ISO-8859-1 must be treated as 
> Windows-1252, should it still be considered an error to use the C1 
> control characters (U+0080 to U+009F) if ISO-8859-1 is declared?
>
> Some relevant messages from IRC:
>
> [15:59] <Lachy> since the spec says if ISO-8859-1 is declared, Windows-1252
> must be used. Is it still an error for authors to use the C1 control
> characters in the range 128-159?
> [16:23] <Hixie> Lachy: not sure what we should do, there's a bunch of corner
> cases there. like, should we allow control chars anyway, should we allow
> ISO-8859-1 to be declared but Win1252 to be used, etc.
> [16:23] <Hixie> Lachy: can you mail the list with suggestions and a list of
> the cases you can think of that we should cover?
> [16:27] <Lachy> I'm having a hard time deciding if it should be allowed or not
> [16:28] <Lachy> Technically, it is an error and I think users should be
> notified, but it's practically harmless these days and very common.
> [16:30] <Lachy> Yet, doing the same thing in XML doesn't work, since XML
> parsers do treat them as control characters

I've made it be a parse error. I'm sure implementing this is going to very 
exciting for Henri.


On Thu, 1 Mar 2007, Henri Sivonen wrote:
>
> I think that encoding information should be included in the HTTP 
> payload. In my opinion, the spec should not advice against this. 
> Preferably, it would encourage putting the encoding information in the 
> payload. (The BOM or, in the case of XML, the UTF-8 defaulting of the 
> XML sniffing algorithm are fine.)

I can't seem to find the part of the spec that recommends the opposite of 
this... did I already remove it? I'm happy to make the spec silent on this 
point, since experts disagree.


On Sun, 11 Mar 2007, Geoffrey Sneddon wrote:
>
> From implementing parts of the input stream (section 8.2.2 as of 
> writing) yesterday, I found several issues (some of which will show the 
> asshole[1] within me):
>
>  - Within the step one of the get an attribute sub-algorithm it says
> "start over" � is this start over the sub-algorithm or the whole algorithm?

Fixed.


>  - Again in step one, why do we need to skip whitespace in both the
> sub-algorithm and at section one of the inner step for <meta> tags?

Otherwise, the <meta bit would be pointing at the "<" and would treat 
"meta" as an attribute name.


>  - In step 11, when we have anything apart from a double/single quote
> or less/greater than sign, we add it to the value, but don't move the position
> forward, so when we move onto step 12 we add it again.

Yes, valid point. Fixed.


>  - In step 3 of the very inner set of steps for a content attribute in
> a <meta> tag, is charset case-sensitive?

Doesn't matter, the parser lowercases everything anyway.


>  - Again there, shouldn't we be given unicode codepoints for that (as
> it'll be a unicode string)?

Not sure what you mean.


On Sat, 26 May 2007, Henri Sivonen wrote:
>
> The draft says:
> "A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present."
>
> That's reasonable for UTF-8 when the encoding has been established by 
> other means.
>
> However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed to be 
> signatureless), do we really want to drop the BOM silently? Shouldn't it 
> count as a character that is in error?

Do the UTF-16LE and UTF-16BE specs make a leading BOM an error?

If yes, then we don't have to say anything, it's already an error.

If not, what's the advantage of complaining about the BOM in this case?


> Likewise, if an encoding signature BOM has been discarded and the first 
> logical character of the stream is another BOM, shouldn't that also 
> count as a character that is in error?
>
> I think I should elaborate that when the encoding is UTF-16 (not 
> UTF-16LE or UTF-16BE), the BOM is gets swallowed by the character 
> decoding layer (in reasonable decoder implementations) and is not 
> returned from the character stream at all. Therefore, on the character 
> level, a droppable BOM only occurs in UTF-8 when the encoding was 
> established by other means.

The spec says: "Given an encoding, the bytes in the input stream must be 
converted to Unicode characters for the tokeniser, as described by the 
rules for that encoding, except that leading U+FEFF BYTE ORDER MARK 
characters must not be stripped by the encoding layer."


On Mon, 28 May 2007, Henri Sivonen wrote:
>
> To this end, I think at least for conforming documents the algorithm for 
> establishing the character encoding should be deterministic. I'd like to 
> request two things:
>
> 1) When sniffing for meta charset, the current draft allows a use agent 
> to give up sooner than after examining the first 512 bytes. To make meta 
> charset sniffing reliable and deterministic so that it doesn't depend on 
> flukes in buffering, I think UAs should (if there's no transfer protocol 
> level charset label and no BOM) be required to consumer bytes until they 
> find a meta charset, reach the EOF or have examined 512 bytes. That is, 
> I think UAs should not be allowed to give up earlier. (On the other 
> hand, I think UAs should be allowed to start examining the byte stream 
> before 512 have been buffered without an IO error, since in general, 
> byte stream buffer management should be up to the IO libraries and 
> outside the scope of the HTML spec.)

I don't want to do this because I don't want to require that browsers 
handle a CGI script that outputs 500 bytes than hangs for a minute in a 
way that doesn't render anything for a minute, and I don't want to require 
that people writing such CGI scripts front-load a 512 byte comment.

We've already conceeded that a page can document.write() an encoding 
declaration after 6 megabytes of content and end up causing a reparse.


> 2) Since the chardet step is optional and the spec doesn't make the 
> Mozilla chardet behavior normative, I think the document should be 
> considered non-conforming if the algorithm for establishing the 
> character encoding proceeds to steps 6 (chardet) or 7 (last resort 
> default).

That would make most of my pages non-conforming. It would make this 
non-conforming:

   <!DOCTYPE HTML>
   <html>
    <head>
     <title> Example </title>
    </head>
    <body>
     <p> I don't want to be non-conforming! </p>
    </body>
   </html>


> It wouldn't hurt, though, to say in the section on writing documents that at
> least one of the following is required for document conformance:
>  * A transfer protocol-level character encoding declaration.
>  * A meta charset within the first 512 bytes.
>  * A BOM.

We already require that, though without the 512 byte requirement.


On Tue, 29 May 2007, Henri Sivonen wrote:
>
> To avoid stepping on the toes of Charmod more than is necessary, I 
> suggest making it non-conforming for a document to have bytes in the 
> 0x80�0x9F range when the character encoding is declared to be one of the 
> ISO-8859 family encodings.

Done, I believe.


> (UA conformance requires in some cases these bytes to be decoded in a 
> Charmod-violating way, but reality trumps Charmod for UA conformance. 
> While I'm at it: Surely there are other ISO-8859 family encodings 
> besides ISO-8859-1 that require decoding using the corresponding 
> windows-* family decoder?)

Maybe; anyone have any concrete information?


On Tue, 29 May 2007, Maciej Stachowiak wrote:
>
> I don't know of any ISO-8859 encodings requiring this, but for all 
> unicode encodings and numeric entity references compatibility requires 
> interpreting this range of code points in the WinLatin1 way.

On Mon, 4 Jun 2007, Henri Sivonen wrote:
>
> I tested with Firefox 2.0.4, Minefield, Safari 2.0.4, WebKit nightly and 
> Opera 9.20 (all on Mac). Only Safari 2.0.4 gives the DWIM treatment the 
> C1 code point range in UTF-8 and UTF-16.
>
> This makes me suspect that compatibility with the Web doesn't really 
> require the DWIM treatment here. What does IE7 do?
>
> The data I used: http://hsivonen.iki.fi/test/utf-c1/

IE7 and Safari 3 do the same as the other browsers, namely, no DWIM 
treatment.

So, I haven't changed the spec.


On Fri, 1 Jun 2007, Henri Sivonen wrote:
>
> The anomalies seem to be:
>  1) ISO-8859-1 is decoded as Windows-1252.
>  2) 0x85 in ISO-8859-10 and in ISO-8859-16 is decoded as in Windows-1252
> (ellipsis) by Gecko.
>  3) ISO-8859-11 is decoded as Windows-874.
>
> I was rather surprised by the results. They weren't at all what I expected.
> Test data: http://hsivonen.iki.fi/test/iso8859/
>
> I suggest adding the ISO-8859-11 to Windows-874 mapping to the spec.

On Fri, 1 Jun 2007, Henri Sivonen wrote:
>
> By Firefox and Opera. Safari doesn't support ISO-8859-11 and I was 
> unable to test IE.

On Fri, 1 Jun 2007, Simon Pieters wrote:
>
> IE7 and Opera handle ISO-8859-11.htm the same, AFAICT.

I did some studies and there appear to be enough pages as ISO-8859-11 to 
add this. I didn't check how many had bytes in the affected range, which 
maybe would be worth checking, though.


On Sat, 2 Jun 2007, �istein E. Andersen wrote:
>
> As suggested earlier [1], a simpler solution seems to be to treat C1 
> bytes and NCRs from /all/ ISO-8859-* and Unicode encodings as 
> Windows-1252.

That seems excessive.


On Tue, 5 Jun 2007, Henri Sivonen wrote:
> >
> > To avoid stepping on the toes of Charmod more than is necessary, I 
> > suggest making it non-conforming for a document to have bytes in the 
> > 0x80�0x9F range when the character encoding is declared to be one of 
> > the ISO-8859 family encodings.
>
> I've been thinking about this. I have a proposal on how to spec this 
> *conceptually* and how to implement this with error reporting. I am 
> assuming here that 1) No one ever intends C1 code points to be present 
> in the decoded stream and 2) we want, as a Charmod correctness fig leaf, 
> to make the C1 bytes non-conforming when ISO-8859-1 or ISO-8859-11 was 
> declared but Windows-1252 or Windows-874 decoding is needed.

I really don't care too much about the fig leaf part.


> Based on the behavior of Minefield and Opera 9.20, the following seems 
> to be the least Charmod violating and least quirky approach that could 
> possibly work:
>
> 1) Decode the byte stream using a decoder for whatever encoding was declared,
> even ISO-8859-1 or ISO-8859-11, according to ftp://
> ftp.unicode.org/Public/MAPPINGS/.
> 2) If a character in the decoded character stream is in the C1 code point
> range, this is a document conformance violation.
>    2a) If the declared encoding was ISO-8859-1, replace that character with
> the character that you get by casting the code point into a byte and decoding
> it as Windows-1252.
>    2b) If the declared encoding was ISO-8859-11, replace that character with
> the character that you get by casting the code point into a byte and decoding
> it as Windows-874.

That sounds far more complex than what we have now.


On Tue, 5 Jun 2007, Kristof Zelechovski wrote:
>
>     2c) If the declared encoding was ISO-8859-2, replace that character 
> with the character that you get by casting the code point into a byte 
> and decoding it as Windows-1250.

On Tue, 5 Jun 2007, Henri Sivonen wrote:
>
> As far as I can tell, that's not what Firefox, Minefield, Opera 9.20 and 
> WebKit nightlies do, so apparently it is not required for compatibility 
> with a notable number of pages.

Indeed.


On Tue, 5 Jun 2007, Maciej Stachowiak wrote:
>
> What we actually do in WebKit is always use a windows-1252 decoder when 
> ISO-8859-1 is requested. I don't think it's very helpful to make all 
> documents that declare a ISO-8859-1 encoding and use characters in the 
> C1 range nonconforming. It's true that they are counting on nonstandard 
> processing of the nominally declared encoding, but I don't think that 
> causes a problem in practice, as long as the rule is well known. It 
> seems simpler to just make latin1 an alias for winlatin1.

I agree.


On Fri, 1 Jun 2007, Raphael Champeimont (Almacha) wrote:
>
> I think there is something wrong in the "get an attribute" algorithm 
> from 8.2.2. The input stream.
>
> Between steps 11 and 12 I think there is a missing:
>
> 11b: Advance position to the next byte.
>
> With the current algorithm, if I write <meta charset = ascii> it will 
> say the value of attribute charset is "aascii" with one too much leading 
> A
>
> The reason is that in step 11 if we fall in case "Anything else" we add 
> the new char to the string, and then if we fall in "Anything else" in 
> step 12 we add again the *same* char to the string, so the first char of 
> the attribute value appears 2 times.

Fixed. (Though please check. I made several changes to this algorithm and 
would be happier if I knew someone had proofread the changes!)


On Fri, 1 Jun 2007, Henri Sivonen wrote:
>
> In the charset meta sniffing algorithm under "Attribute name:":
>
> > If it is 0x2F (ASCII '/'), 0x3C (ASCII '<'), or 0x3E (ASCII '>')
> >     Stop looking for an attribute. The attribute's name is the value of
> > attribute name, its value is the empty string.
>
> In general, it seems to me the algorithm isn't quite clear on when to 
> stop looking for the current attribute and when to stop looking for 
> attributes for the current tag altogether.

The spec never distinguishes these two cases in the "get an attribute" 
algorithm -- the algorithm that invokes the "get an attribute" algorithm 
is the one that decides how often it is done.


> In this step, it seems to me that '/' should advance the pointer and end 
> getting the current attribute followed by getting another attribute. '>' 
> should end getting attributes on the whole tag without changing the 
> pointer.

It doesn't matter. Both return an attribute, then the invoking algorithm 
retries and if that results in no attribute (because you're on the ">") 
then you stop looking for the tag.


On Fri, 1 Jun 2007, Henri Sivonen wrote:
>
> The spec probably needs to be made more specific about the case where 
> the ASCII byte-based algorithm finds a supported encoding name but the 
> encoding is not a rough ASCII superset.
>
> 23:46 < othermaciej> one quirk in Safari is that if there's a meta tag
> claiming
>                      the source is utf-16, we treat it as utf-8
> ...
> 23:48 < othermaciej> hsivonen: there is content that needs it
> ...
> 23:52 < othermaciej> hsivonen: I think we may treat any claimed unicode
> charset
>                      in a <meta> tag as utf-8

Oops, I had this for the case where utf-16 was detected on the fly, but 
not for the preparser. Fixed.


On Sat, 2 Jun 2007, Philip Taylor wrote:
>
> 8.2.2. The input stream: "If the next six characters are not 'charset'" 
> - s/six/seven/

Fixed.


On Thu, 14 Jun 2007, Henri Sivonen wrote:
>
> As written, the charset sniffing algorithm doesn't trim space characters 
> from around the tentative encoding name. html5lib test case expect the 
> space characters to be trimmed.
>
> I suggest trimming space characters (or anything <= 0x20 depending on 
> which approach is the right for compat).

Actually it seems browsers don't do any trimming here. I've added a 
comment to that effect.


On Sat, 23 Jun 2007, �istein E. Andersen wrote:
> >> 
> >>> Bytes or sequences of bytes in the original byte stream that could 
> >>> not be converted to Unicode characters must be converted to U+FFFD 
> >>> REPLACEMENT CHARACTER code points.
> >> 
> >> [This does not specify the exact number of replacement chracters.]
> >
> > I don't really know how to define this.
>
> Unicode 5.0 remains vague on this point. (E.g., definition D92 defines 
> well-formed and ill-formed UTF-8 byte sequences, but conformance 
> requirement C10 only requires ill-formed sequences to be treated as an 
> error condition and suggests that a one-byte ill-formed sequence may be 
> either filtered out or replaced by a U+FFFD replacement character.) More 
> generally, character encoding specifications can hardly be expected to 
> define proper error handling, since they are usually not terribly 
> preoccupied with mislabelled data.

They should define error handling, and are defective if they don't. 
However, I agree that many specs are defective. This is certainly not 
limited to character encoding specifications.


> The current text may nevertheless be two liberal. It would notably be 
> possible to construct an arbitrarily long Chinese text in a legacy 
> encoding which -- according to the spec -- could be replaced by one 
> single U+FFFD replacement character if incorrectly handled as UTF-8, 
> which might lead the user to think that the page is completely 
> uninteresting and therefore move on, whereas a larger number of 
> replacement characters would have led him to try another encoding. (This 
> is only a problem, of course, if an implementor chooses to emit the 
> minimal number of replacement characters sanctioned by the spec.)

Yes, but this is a user interface issue, not an interoperability issue, so 
I don't think we need to be concerned about it.


On Thu, 2 Aug 2007, Henri Sivonen wrote:

> On Aug 2, 2007, at 10:11, Ian Hickson wrote:
>
> > Would a non-normative note help here? Something like:
> > 
> >    Note: Bytes or sequences of bytes in the original byte stream that did
> >    not conform to the encoding specification (e.g. invalid UTF-8 byte
> >    sequences in a UTF-8 input stream) are errors that conformance
> >    checkers are expected to report.
> > 
> > ...to be put after the paragraph that reads "Bytes or sequences of 
> > bytes in the original byte stream that could not be converted to 
> > Unicode characters must be converted to U+FFFD REPLACEMENT CHARACTER 
> > code points".
>
> Yes, this is what I meant with "a note hinting the consequences.

Ok, added.


> > (Note that not all bytes or sequences of bytes in the original byte 
> > stream that could not be converted to Unicode characters are 
> > necessarily errors. It could just be that the encoding has a character 
> > set that isn't a subset of Unicode, e.g. the Apple logo found in most 
> > Apple character sets doesn't have a non-PUA analogue in Unicode. Its 
> > presence in an HTML document isn't an error as far as I'm concerned.)
>
> Since XML and HTML5 are defined in terms of Unicode, characters there's 
> nowhere to go except error and REPLACEMENT CHARACTER or the PUA for 
> characters that aren't in Unicode. I'd steer clear of this in the spec 
> an let decoders choose between de facto PUA assignments (like U+F8FF for 
> the Apple logo) and errors.

Yeah I don't have any intention on mentioning this in the spec.


On Wed, 31 Oct 2007, Martin Duerst wrote:
>
> [8.2.2.1]
>
> In point 3., it's not completely clear whether the encoding returned is 
> e.g. "UTF-16BE BOM" or "UTF-16BE". Probably the best thing editorially 
> is to move the word BOM from the description column of the table to the 
> text prior to the table.

Fixed.


> In point 7, what I find unnecessary is the repeated mention of heuristic 
> algorithms, which are already mentioned previously in point 6.

The heuristics in step 6 are for detemrining an encoding based on the byte 
stream, e.g. using frequency analysis. The heuristics in step 7 are for 
picking a default once that has failed. For example, if the defaults are 
UTF-8 or Win1252, then you can determine which to pick by simply deciding 
whether or not the stream is valid UTF-8.


> (I'm really interested what document [UNIVCHADET] is going to point to.)

http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

(It's in the source.)


> What I find missing/unclear is that the user can overwrite the page 
> encoding manually. What is mentioned is a user-specificed default, which 
> makes sense (e.g. "well, I'm mostly viewing Chinese pages, so I set my 
> default to GB2132"). However, what we also need is the possibility for a 
> user to override the encoding of a specific page (not changing the 
> default). This is necessary because some pages are still mislabeled. 
> When such an override is present, it should come before what's currently 
> number 1.

User agents can provide user interfaces to override anything they want, 
e.g. they could provide an interface that changes all <script> elements 
into <pre> elements on the fly, or whatever. Such behaviour is outside the 
scope of the specification, since it is no longer about interoperability, 
but about user control. It's technically non-compliant, because it is 
doing something with the page that doesn't match what would happen for 
other people (unless they _also_ overrode the spec behaviour).


> In 8.2.2.2, what I find unnecessary is that encodings such as UTF-7 are 
> explicitly forbidden. I agree that these are virtually useless. However, 
> I don't think implementing them would create any harm, and I don't think 
> they should be dignified by even mentioning them.

Sadly they do cause harm. The ones that are outlawed have all been used in 
eithir actual attacks or proof-of-concept attacks described in 
vulnerability reports, mostly due to their deceptive similarity to more 
common encodings. (UTF-7 in particular has been used in a number of 
attacks, because IE supported auto-detecting it, if I recall correctly.)


> In 8.2.2.4, I have no idea what's the reason or purpose of point 1, 
> which reads "If the new encoding is UTF-16, change it to UTF-8.". I 
> suspect some misunderstanding.

This is required because many pages are labelled as UTF-16 but actually 
use UTF-8. For example:

  http://www.zingermans.com


> Well, now let's get back to CharMod, and to the place where I think you 
> need to do more work. HTML5 currently says "treat data labeled 
> iso-8859-1 as windows-1252". This conflicts with C025 of CharMod 
> (http://www.w3.org/TR/charmod/#C025):
>
> C025 [I] [C] An IANA-registered charset name MUST NOT be used to label 
> text data in a character encoding other than the one identified in the 
> IANA registration of that name.
>
> and also C030 (http://www.w3.org/TR/charmod/#C030): C030 [I] When an 
> IANA-registered charset name is recognized, receiving software MUST 
> interpret the received data according to the encoding associated with 
> the name in the IANA registry.
>
> So the following sentence:
>
> "When a user agent would otherwise use the ISO-8859-1 encoding, it must 
> instead use the Windows-1252 encoding."
>
> from HTML5 is clearly not conforming to CharMod.

Indeed, it says so explicitly in the spec.


> Please note that the above items (C025 and C030) say that they only 
> affect implementations ([I]) and content ([C]), but I think the main 
> reason for this is that we never even immagined that a spec would say 
> "you must treat FOO as BAR".
>
> I don't disagree with 'widely deployed', but I think one main reason for 
> this is that it took ages to get windows-1252 registered. I think there 
> are other ways to deal with this issue than a MUST. One thing that I 
> guess you could do is to just describe current practice.

Well, what we're describing is what an implementation has to do to be 
compatible with the other implementations. And right now, this is one of 
the things it has to do.


> This brings me to another point: The whole HTML5 spec seems to be 
> written with implementers, and implementers only, in mind. This is great 
> to help get browser behavior aligned, but it creates an enormous 
> problem: The majority of potential users of the spec, namely creators of 
> content, and of tools creating content, are completely left out. As an 
> example, trying to reverse-engineer how to indicate the character 
> encoding inside an HTML5 document from point 4 in 8.2.2.1 is completely 
> impossible for content creators, webmasters, and the like.

Section "8.2 Parsing HTML documents" is indeed exclusively for user agent 
implementors and conformance checker implementors. For authors and 
authoring tool implementors, you want section "8.1 Writing HTML documents" 
and section "3.7.5.4. Specifying the document's character encoding" (which 
is linked to from 8.1). These give the flipside of these requirements, the 
authoring side.


On Sat, 3 Nov 2007, Addison Phillips wrote:
>
> --
> Otherwise, return an implementation-defined or user-specified default 
> character encoding, with the confidence tentative. Due to its use in 
> legacy content, windows-1252 is recommended as a default in 
> predominantly Western demographics. In non-legacy environments, the more 
> comprehensive UTF-8 encoding is recommended instead. Since these 
> encodings can in many cases be distinguished by inspection, a user agent 
> may heuristically decide which to use as a default.
> --
>
> Our comment is that this is a pretty weak recommendation. It is 
> difficult to say what a "Western demographic" means in this context. We 
> think we know why this is here: untagged HTML4 documents have a default 
> character encoding of ISO 8859-1, so it is unsurprising to assume its 
> common superset encoding when no other encoding can be guessed.
>
> However, we would like to see several things happen here:
>
> 1. It never actually says anywhere why windows-1252 must be used instead 
> of ISO 8859-1.

This is required in "Preprocessing the input stream".


> 2. As quoted, it seems to (but does not actually) favor 1252 over UTF-8. 
> Since UTF-8 is highly detectable and also the best long-term general 
> default, we'd prefer if the emphasis were reversed, dropping the 
> reference to "Western demographics". For example:
>
> --
> Otherwise, return an implementation-defined or user-specified default 
> character encoding, with the confidence tentative. UTF-8 is recommended 
> as a default encoding in most cases. Due to its use in legacy content, 
> windows-1252 is also recommended as a default. Since these encodings can 
> usually be distinguished by inspection, a user agent may heuristically 
> decide which to use as a default.
> --

I've reversed the order, though not removed the mention of the Western 
demographic, which I think is actually quite accurate and genernally more 
understandable than, say, occidental. I would like to know what the more 
common codecs are in oriental demographics, though, to broaden the use of 
the recommendations.


> 3. Possibly something should be said (elsewhere, not in this paragraph) 
> about using other "superset" encodings in preference to the explicitly 
> named encoding (that is, other encodings bear the same relationship as 
> windows-1252 does to iso8859-1 and user-agents actually use these 
> encodings to interpret pages and/or encode data in forms, etc.)

Is the current (new) text sufficient in this regard? See also the earlier 
comments for details on the decisions behind the new text.


On Thu, 6 Dec 2007, Sam Ruby wrote:
> Ian Hickson wrote:
> > On Wed, 5 Dec 2007, Sam Ruby wrote:
> > > Henri Sivonen wrote:
> > > > I identified four classes of errors:
> > > >  1) meta charset in XHTML
> > > Why specifying a charset that matches the encoding is flagged as an 
> > > error is probably something that should be discussed another day.  
> > > I happen to believe that people will author content intended to be 
> > > used by multiple user agents which are at various levels of spec 
> > > conformance.
> > 
> > That's actually an XML issue -- XML says the encoding should be in the 
> > XML declaration, so HTML tries to not step on its toes and says that 
> > the charset declaration shouldn't be included in the markup. (The spec 
> > has to say that the UA must ignore that line anyway, so it's not clear 
> > that there's any benefit to including it.)
>
> If the declaration clashed, I could see the value in an error message, 
> but as I said, this can be discussed another day.

Is it another day yet? :-)


On Fri, 25 Jan 2008, Frank Ellermann wrote:
>
> Hi, the chapter about "acceptable" charsets (8.2.2.2) is messy. Clearly 
> UTF-8 and windows-1252 are popular, and you have that.
>
> What you need as a "minimum" for new browsers is UTF-8, US-ASCII (as 
> popular proper subset of UTF-8), ISO-8859-1 (as HTML legacy), and 
> windows-1252 for the reasons stated in the draft, supporting Latin-1 but 
> not windows-1252 would be stupid.

Right, that's what the draft current requires.


> BTW, I'm not aware that windows-1252 is a violation of CHARMOD, I asked 
> a question about it and C049 in a Last Call of CHARMOD.

See one of the earlier e-mails in this compound reply for the reasoning.


> Please s/but may support more/but should support more/ - the minimum is 
> only that, the minimum.

"SHOULD" has very strong connotations that I do not think apply here. In 
particular, it makes no sense to have an open-ended SHOULD in this 
context.


> | User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
> | encodings
>
> I can see a MUST NOT for UTF-7 and CESU-8.  And IMO the only good excuse 
> for legacy charsets is backwards compatibility.  But that is at worst a 
> "SHOULD NOT" for BOCU-1, as you have it for UTF-32.
>
> I refuse to discuss SCSU, but MUST NOT is rather harsh, isn't it ?

As noted earlier, these requirements are derived from real or potential 
security vulnerabilities.


> In 3.7.5.4 you say:
>
> | Authors should not use JIS_X0212-1990, x-JIS0208, and encodings
> | based on EBCDIC.  Authors should not use UTF-32.
>
> What's the logic behind these recommendations ?  Of course EBCDIC
> is rare (as far as HTML is concerned I've never seen it), but it's
> AFAIK not worse than codepage 437, 850, 858, or similar charsets.

Those are non-US-ASCII-compatible encodings. For further reasoning see the 
thread that resulted in:

   http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/011949.html


> And UTF-32 is relatively harmless, not much worse than UTF-16, it 
> belongs to the charsets recommended in CHARMOD.  Depending on what 
> happens in future Unicode versions banning UTF-32 could backfire.

Actually UTF-32 is quite harmful, due to its extra cost in implementation, 
its very limited testing, and the resulting bugs in almost all known 
implementations.


> There are lots of other charsets starting with UTF-1 that could be 
> listed as SHOULD NOT or even MUST NOT.  Whatever you pick, state what 
> your reasons are, not only the (apparently) arbitrary result.

The reasons are sometimes rather involved or subtle, and I'd rather not 
have the specification defend itself. It's a spec, not a positon paper. :-)


> Please make sure that all *unregistered* charsets are SHOULD NOT. Yes, I 
> know the consequences for some proprietary charsets, they are free to 
> register them or to be ignored (CHARMOD C022).

It's already a must ("The value must be a valid character encoding name, 
and must be the preferred name for that encoding.").


On Tue, 29 Jan 2008, Brian Smith wrote:
> Henri Sivonen wrote:
> > My understanding is that HTML 5 bans these post-UTF-8 
> > second-system Unicode encodings no matter where you might 
> > declare the use.
>
> It is in section 3.7.5 (the META element), and not in section 8 (The 
> HTML Syntax), and the reference to section 3.7.5 in section 8 says that 
> the restrictions apply (only) in a (<META>) character encoding 
> declaration. So, it seems the real issue is just clarifying the text in 
> 3.7.5.4 to indicate that those restrictions apply only when the META 
> charset override mechanism is being used.

I don't understand.


> > The purpose of the HTML 5 spec is to improve interoperability between 
> > Web browsers as used with content and Web apps published on the one 
> > public Web. The normative language in the spec is concerned with 
> > publishing and consuming content and apps on the Web. The purpose of 
> > the spec isn't to lower the R&D cost of private and proprietary 
> > systems by producing reusable bits.
>
> Then why doesn't the specification list the encodings that conformant 
> web browsers are required to support, instead of listing the encodings 
> that document authors are forbidden from using.

Because former the list is open-ended, whereas the latter list is not, 
and the latter list is more important.


> > > Even after Unicode and the UTF encodings, new encodings are still 
> > > being created.
> > 
> > Deploying such encodings on the public network is a colossally bad 
> > idea. (My own nation has engaged in this folly with ISO-8859-15, so 
> > I've seen the bad consequences at home, too.)
>
> That is exactly my point. If the intention is that BOCU-1 should be 
> prohibited, then shouldn't ISO-8859-15 be prohibited for the same 
> reason? Why one and not the other?

One is used. The other is not. It really is that simple. We can stop the 
madness for one of them, but it's too late for the other.


> Anyway, I am pretty sure that the restriction against BOCU and similar 
> encodings is just to make it possible to correctly parse the <META> 
> charset override, not to prevent their use altogether. The language just 
> needs to be made clearer.

As the spec says, "authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU 
encodings". There's no limitation to <meta> or anything. They are just 
banned outright.


On Thu, 31 Jan 2008, Henri Sivonen wrote:
>
> I ran an analysis on recent error messages from Validator.nu. 
> http://hsivonen.iki.fi/test/moz/analysis.txt

Looking at this from the point of view of encodings, I see the following 
common errors:

 * <meta charset> not being at the top of <head>
 * missing explicit character encoding declaration
 * <meta content=""> not starting with text/html
 * unpreferred encoding names

I think all of these are real errors, and I don't think we should change 
the spec's encoding rules based on this data.

Thanks for this data. Basing spec development on real data like this is of 
huge value.


On Thu, 31 Jan 2008, Sam Ruby wrote:
> >
> > I think we should allow the old internal encoding declaration syntax 
> > for text/html as an alternative to the more elegant syntax. Not 
> > declaring the encoding is bad, so we shouldn't send a negative message 
> > to the authors who are declaring the encoding. Moreover, this is 
> > interoperable stuff.
> > 
> > I think we shouldn't allow this for application/xhtml+xml, though, 
> > because authors might think it has an effect.
>
> By that reasoning, a meta charset encoding declaration should not be 
> allowed if a charset is specified on the Content-Type HTTP header.  I 
> ran into that very problem today:
>
> http://lists.planetplanet.org/archives/devel/2008-January/001747.html
>
> This content was XHTML, but was served as text/html, with a charset 
> specified on the HTTP header, which overrode the charset on the meta 
> declaration.

If they don't match, then there's an error (forcibly so, since one of the 
two encodings has to be wrong!).


> Serving XHTML as text/html, with BOTH a charset specified on the HTTP 
> header AND a meta charset specified just in case is more common than you 
> might think.

It's not a recommended behaviour, though. Just pick one and use it. The 
practice of making documents schizophrenic like this is a side-effect of 
the market not fully supporting XHTML (i.e. IE). If it wasn't for that, 
people wouldn't be as determined to give their documents identity crises.


> A much more useful restriction -- spanning both the HTML5 and XHTML5 
> serializations -- would be to issue an error if multiple sources for 
> encoding information were explicitly specified and if they differ.

That's already required.


On Mon, 11 Feb 2008, Henri Sivonen wrote:
> >
> > A much more useful restriction -- spanning both the HTML5 and XHTML5 
> > serializations -- would be to issue an error if multiple sources for 
> > encoding information were explicitly specified and if they differ.
>
> I agree. I had already implemented this as a warning on the XML side. 
> (Not as an error because I'm not aware of any spec that I could justify 
> for calling it an error.)

If the declarations disagree, one of them is wrong. It's an error for the 
declaration to be wrong.


> While I was at it, I noticed that the spec (as well as Gecko) don't 
> require http-equiv='content-type' when looking for a content attribute 
> that looks like an internal encoding declaration. Therefore, I also 
> added a warning that fires if the value of a content attribute would be 
> sniffed as an internal character encoding declaration but a 
> http-equiv='content-type' is missing.

It's an error according to the spec.


On Fri, 1 Feb 2008, Henri Sivonen wrote:
>
> But surely the value for content should be ASCII-case-insensitive.

Ok.


> Also, why limit the space to one U+0020 instead of zero or more space 
> characters?

Ok, allowed any number of space characters (and any space characters).

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 29 February 2008 01:21:43 UTC