- From: Mark Davis <mark.davis@icu-project.org>
- Date: Thu, 28 Feb 2008 17:42:58 -0800
- To: "Ian Hickson" <ian@hixie.ch>
- Cc: public-i18n-core@w3.org
On Thu, Feb 28, 2008 at 5:21 PM, Ian Hickson <ian@hixie.ch> wrote: > > Executive summary: I made a number of changes, as described below, in > response to the feedback on character encodings in HTML. They are covered > by revisions 1263 to 1275 of the spec source. > > I have cc'ed most (though not all) of the mailing lists that were > originally cc'ed on the messages to which I reply below, to keep everyone > in the loop. Please, for everyone's sake, pick a single mailing list when > replying, and trim the quotes to just the bits to which you are replying. > Don't include the whole of this e-mail in your reply! Thanks. > > > On Sun, 5 Nov 2006, Øistein E. Andersen wrote, in reply to Henri: > > > > > > I think conforming text/html documents should not be allowed to parse > > > into a DOM that contains characters that are not allowed in XML 1.0. > > > [...] I am inclined to prefer [...] U+FFFD > > (I've made the characters not allowed in XML also not allowed in HTML, > with the exception of some of the space characters which we need to have > allowed for legacy reasons.) > > > > I perfectly agree. (Actually, i think that U+7F (delete) and the C1 > > control characters should be excluded [transformed into U+FFFD] as well, > > but this could perhaps be problematic due to spurious CP1252 > > characters.) > > I've made them illegal but not converted them to FFFD. > > > On Mon, 6 Nov 2006, Lachlan Hunt wrote: > > > > At the very least, ISO-8859-1 must be treated as Windows-1252. I'm not > > sure about the other ISO-8859 encodings. Numeric and hex character > > references from 128 to 159 must also be treated as Windows-1252 code > > points. > > All already specified. > > > On Sun, 5 Nov 2006, Elliotte Harold wrote: > > > > The specific problem is that an author may publish a correctly labeled > > UTF-8 or ISO-8859-8 document or some such. However the server sends a > > Content-type header that requires the parser to treat the document as > > ISO-8859-1 or US-ASCII or something else. > > > > The need is for server administrators to allow content authors to > > specify content types and character sets for the documents they write. > > The content doesn't need to change. The authors just need the ability to > > specify the server headers for their documents. > > Well, we can't change the way this works from this side, so it's not > really our problem at this point. > > > On Sat, 23 Dec 2006, Henri Sivonen wrote: > > > > http://www.elementary-group-standards.com/web-standards/html5-http-equiv-difference.html > > > > In short, some authors want to use <meta http-equiv="imagetoolbar" > > content="no"> but (X)HTML5 doesn't allow it. > > > > Personally, I think that authors who want to disable *User* Agent > > features like that are misguided. > > > > Anyway, I thought I'd mention this so that the issue gets informed as > > opposed to accidental treatment. > > Proprietary extensions to HTML are just that, proprietary extensions, and > are therefore by intentionally not conforming. > > > On Mon, 26 Feb 2007, Lachlan Hunt wrote: > > > > Given that the spec now says that ISO-8859-1 must be treated as > > Windows-1252, should it still be considered an error to use the C1 > > control characters (U+0080 to U+009F) if ISO-8859-1 is declared? > > > > Some relevant messages from IRC: > > > > [15:59] <Lachy> since the spec says if ISO-8859-1 is declared, Windows-1252 > > must be used. Is it still an error for authors to use the C1 control > > characters in the range 128-159? > > [16:23] <Hixie> Lachy: not sure what we should do, there's a bunch of corner > > cases there. like, should we allow control chars anyway, should we allow > > ISO-8859-1 to be declared but Win1252 to be used, etc. > > [16:23] <Hixie> Lachy: can you mail the list with suggestions and a list of > > the cases you can think of that we should cover? > > [16:27] <Lachy> I'm having a hard time deciding if it should be allowed or not > > [16:28] <Lachy> Technically, it is an error and I think users should be > > notified, but it's practically harmless these days and very common. > > [16:30] <Lachy> Yet, doing the same thing in XML doesn't work, since XML > > parsers do treat them as control characters > > I've made it be a parse error. I'm sure implementing this is going to very > exciting for Henri. > > > On Thu, 1 Mar 2007, Henri Sivonen wrote: > > > > I think that encoding information should be included in the HTTP > > payload. In my opinion, the spec should not advice against this. > > Preferably, it would encourage putting the encoding information in the > > payload. (The BOM or, in the case of XML, the UTF-8 defaulting of the > > XML sniffing algorithm are fine.) > > I can't seem to find the part of the spec that recommends the opposite of > this... did I already remove it? I'm happy to make the spec silent on this > point, since experts disagree. > > > On Sun, 11 Mar 2007, Geoffrey Sneddon wrote: > > > > From implementing parts of the input stream (section 8.2.2 as of > > writing) yesterday, I found several issues (some of which will show the > > asshole[1] within me): > > > > - Within the step one of the get an attribute sub-algorithm it says > > "start over" – is this start over the sub-algorithm or the whole algorithm? > > Fixed. > > > > - Again in step one, why do we need to skip whitespace in both the > > sub-algorithm and at section one of the inner step for <meta> tags? > > Otherwise, the <meta bit would be pointing at the "<" and would treat > "meta" as an attribute name. > > > > - In step 11, when we have anything apart from a double/single quote > > or less/greater than sign, we add it to the value, but don't move the position > > forward, so when we move onto step 12 we add it again. > > Yes, valid point. Fixed. > > > > - In step 3 of the very inner set of steps for a content attribute in > > a <meta> tag, is charset case-sensitive? > > Doesn't matter, the parser lowercases everything anyway. > > > > - Again there, shouldn't we be given unicode codepoints for that (as > > it'll be a unicode string)? > > Not sure what you mean. > > > On Sat, 26 May 2007, Henri Sivonen wrote: > > > > The draft says: > > "A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present." > > > > That's reasonable for UTF-8 when the encoding has been established by > > other means. > > > > However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed to be > > signatureless), do we really want to drop the BOM silently? Shouldn't it > > count as a character that is in error? > > Do the UTF-16LE and UTF-16BE specs make a leading BOM an error? > > If yes, then we don't have to say anything, it's already an error. > > If not, what's the advantage of complaining about the BOM in this case? > > > > Likewise, if an encoding signature BOM has been discarded and the first > > logical character of the stream is another BOM, shouldn't that also > > count as a character that is in error? > > > > I think I should elaborate that when the encoding is UTF-16 (not > > UTF-16LE or UTF-16BE), the BOM is gets swallowed by the character > > decoding layer (in reasonable decoder implementations) and is not > > returned from the character stream at all. Therefore, on the character > > level, a droppable BOM only occurs in UTF-8 when the encoding was > > established by other means. > > The spec says: "Given an encoding, the bytes in the input stream must be > converted to Unicode characters for the tokeniser, as described by the > rules for that encoding, except that leading U+FEFF BYTE ORDER MARK > characters must not be stripped by the encoding layer." > > > On Mon, 28 May 2007, Henri Sivonen wrote: > > > > To this end, I think at least for conforming documents the algorithm for > > establishing the character encoding should be deterministic. I'd like to > > request two things: > > > > 1) When sniffing for meta charset, the current draft allows a use agent > > to give up sooner than after examining the first 512 bytes. To make meta > > charset sniffing reliable and deterministic so that it doesn't depend on > > flukes in buffering, I think UAs should (if there's no transfer protocol > > level charset label and no BOM) be required to consumer bytes until they > > find a meta charset, reach the EOF or have examined 512 bytes. That is, > > I think UAs should not be allowed to give up earlier. (On the other > > hand, I think UAs should be allowed to start examining the byte stream > > before 512 have been buffered without an IO error, since in general, > > byte stream buffer management should be up to the IO libraries and > > outside the scope of the HTML spec.) > > I don't want to do this because I don't want to require that browsers > handle a CGI script that outputs 500 bytes than hangs for a minute in a > way that doesn't render anything for a minute, and I don't want to require > that people writing such CGI scripts front-load a 512 byte comment. > > We've already conceeded that a page can document.write() an encoding > declaration after 6 megabytes of content and end up causing a reparse. > > > > 2) Since the chardet step is optional and the spec doesn't make the > > Mozilla chardet behavior normative, I think the document should be > > considered non-conforming if the algorithm for establishing the > > character encoding proceeds to steps 6 (chardet) or 7 (last resort > > default). > > That would make most of my pages non-conforming. It would make this > non-conforming: > > <!DOCTYPE HTML> > <html> > <head> > <title> Example </title> > </head> > <body> > <p> I don't want to be non-conforming! </p> > </body> > </html> > > > > It wouldn't hurt, though, to say in the section on writing documents that at > > least one of the following is required for document conformance: > > * A transfer protocol-level character encoding declaration. > > * A meta charset within the first 512 bytes. > > * A BOM. > > We already require that, though without the 512 byte requirement. > > > On Tue, 29 May 2007, Henri Sivonen wrote: > > > > To avoid stepping on the toes of Charmod more than is necessary, I > > suggest making it non-conforming for a document to have bytes in the > > 0x80…0x9F range when the character encoding is declared to be one of the > > ISO-8859 family encodings. > > Done, I believe. > > > > (UA conformance requires in some cases these bytes to be decoded in a > > Charmod-violating way, but reality trumps Charmod for UA conformance. > > While I'm at it: Surely there are other ISO-8859 family encodings > > besides ISO-8859-1 that require decoding using the corresponding > > windows-* family decoder?) > > Maybe; anyone have any concrete information? > > > On Tue, 29 May 2007, Maciej Stachowiak wrote: > > > > I don't know of any ISO-8859 encodings requiring this, but for all > > unicode encodings and numeric entity references compatibility requires > > interpreting this range of code points in the WinLatin1 way. > > On Mon, 4 Jun 2007, Henri Sivonen wrote: > > > > I tested with Firefox 2.0.4, Minefield, Safari 2.0.4, WebKit nightly and > > Opera 9.20 (all on Mac). Only Safari 2.0.4 gives the DWIM treatment the > > C1 code point range in UTF-8 and UTF-16. > > > > This makes me suspect that compatibility with the Web doesn't really > > require the DWIM treatment here. What does IE7 do? > > > > The data I used: http://hsivonen.iki.fi/test/utf-c1/ > > IE7 and Safari 3 do the same as the other browsers, namely, no DWIM > treatment. > > So, I haven't changed the spec. > > > On Fri, 1 Jun 2007, Henri Sivonen wrote: > > > > The anomalies seem to be: > > 1) ISO-8859-1 is decoded as Windows-1252. > > 2) 0x85 in ISO-8859-10 and in ISO-8859-16 is decoded as in Windows-1252 > > (ellipsis) by Gecko. > > 3) ISO-8859-11 is decoded as Windows-874. > > > > I was rather surprised by the results. They weren't at all what I expected. > > Test data: http://hsivonen.iki.fi/test/iso8859/ > > > > I suggest adding the ISO-8859-11 to Windows-874 mapping to the spec. > > On Fri, 1 Jun 2007, Henri Sivonen wrote: > > > > By Firefox and Opera. Safari doesn't support ISO-8859-11 and I was > > unable to test IE. > > On Fri, 1 Jun 2007, Simon Pieters wrote: > > > > IE7 and Opera handle ISO-8859-11.htm the same, AFAICT. > > I did some studies and there appear to be enough pages as ISO-8859-11 to > add this. I didn't check how many had bytes in the affected range, which > maybe would be worth checking, though. > > > On Sat, 2 Jun 2007, Øistein E. Andersen wrote: > > > > As suggested earlier [1], a simpler solution seems to be to treat C1 > > bytes and NCRs from /all/ ISO-8859-* and Unicode encodings as > > Windows-1252. > > That seems excessive. > > > On Tue, 5 Jun 2007, Henri Sivonen wrote: > > > > > > To avoid stepping on the toes of Charmod more than is necessary, I > > > suggest making it non-conforming for a document to have bytes in the > > > 0x80…0x9F range when the character encoding is declared to be one of > > > the ISO-8859 family encodings. > > > > I've been thinking about this. I have a proposal on how to spec this > > *conceptually* and how to implement this with error reporting. I am > > assuming here that 1) No one ever intends C1 code points to be present > > in the decoded stream and 2) we want, as a Charmod correctness fig leaf, > > to make the C1 bytes non-conforming when ISO-8859-1 or ISO-8859-11 was > > declared but Windows-1252 or Windows-874 decoding is needed. > > I really don't care too much about the fig leaf part. > > > > Based on the behavior of Minefield and Opera 9.20, the following seems > > to be the least Charmod violating and least quirky approach that could > > possibly work: > > > > 1) Decode the byte stream using a decoder for whatever encoding was declared, > > even ISO-8859-1 or ISO-8859-11, according to ftp:// > > ftp.unicode.org/Public/MAPPINGS/. > > 2) If a character in the decoded character stream is in the C1 code point > > range, this is a document conformance violation. > > 2a) If the declared encoding was ISO-8859-1, replace that character with > > the character that you get by casting the code point into a byte and decoding > > it as Windows-1252. > > 2b) If the declared encoding was ISO-8859-11, replace that character with > > the character that you get by casting the code point into a byte and decoding > > it as Windows-874. > > That sounds far more complex than what we have now. > > > On Tue, 5 Jun 2007, Kristof Zelechovski wrote: > > > > 2c) If the declared encoding was ISO-8859-2, replace that character > > with the character that you get by casting the code point into a byte > > and decoding it as Windows-1250. > > On Tue, 5 Jun 2007, Henri Sivonen wrote: > > > > As far as I can tell, that's not what Firefox, Minefield, Opera 9.20 and > > WebKit nightlies do, so apparently it is not required for compatibility > > with a notable number of pages. > > Indeed. > > > On Tue, 5 Jun 2007, Maciej Stachowiak wrote: > > > > What we actually do in WebKit is always use a windows-1252 decoder when > > ISO-8859-1 is requested. I don't think it's very helpful to make all > > documents that declare a ISO-8859-1 encoding and use characters in the > > C1 range nonconforming. It's true that they are counting on nonstandard > > processing of the nominally declared encoding, but I don't think that > > causes a problem in practice, as long as the rule is well known. It > > seems simpler to just make latin1 an alias for winlatin1. > > I agree. > > > On Fri, 1 Jun 2007, Raphael Champeimont (Almacha) wrote: > > > > I think there is something wrong in the "get an attribute" algorithm > > from 8.2.2. The input stream. > > > > Between steps 11 and 12 I think there is a missing: > > > > 11b: Advance position to the next byte. > > > > With the current algorithm, if I write <meta charset = ascii> it will > > say the value of attribute charset is "aascii" with one too much leading > > A > > > > The reason is that in step 11 if we fall in case "Anything else" we add > > the new char to the string, and then if we fall in "Anything else" in > > step 12 we add again the *same* char to the string, so the first char of > > the attribute value appears 2 times. > > Fixed. (Though please check. I made several changes to this algorithm and > would be happier if I knew someone had proofread the changes!) > > > On Fri, 1 Jun 2007, Henri Sivonen wrote: > > > > In the charset meta sniffing algorithm under "Attribute name:": > > > > > If it is 0x2F (ASCII '/'), 0x3C (ASCII '<'), or 0x3E (ASCII '>') > > > Stop looking for an attribute. The attribute's name is the value of > > > attribute name, its value is the empty string. > > > > In general, it seems to me the algorithm isn't quite clear on when to > > stop looking for the current attribute and when to stop looking for > > attributes for the current tag altogether. > > The spec never distinguishes these two cases in the "get an attribute" > algorithm -- the algorithm that invokes the "get an attribute" algorithm > is the one that decides how often it is done. > > > > In this step, it seems to me that '/' should advance the pointer and end > > getting the current attribute followed by getting another attribute. '>' > > should end getting attributes on the whole tag without changing the > > pointer. > > It doesn't matter. Both return an attribute, then the invoking algorithm > retries and if that results in no attribute (because you're on the ">") > then you stop looking for the tag. > > > On Fri, 1 Jun 2007, Henri Sivonen wrote: > > > > The spec probably needs to be made more specific about the case where > > the ASCII byte-based algorithm finds a supported encoding name but the > > encoding is not a rough ASCII superset. > > > > 23:46 < othermaciej> one quirk in Safari is that if there's a meta tag > > claiming > > the source is utf-16, we treat it as utf-8 > > ... > > 23:48 < othermaciej> hsivonen: there is content that needs it > > ... > > 23:52 < othermaciej> hsivonen: I think we may treat any claimed unicode > > charset > > in a <meta> tag as utf-8 > > Oops, I had this for the case where utf-16 was detected on the fly, but > not for the preparser. Fixed. > > > On Sat, 2 Jun 2007, Philip Taylor wrote: > > > > 8.2.2. The input stream: "If the next six characters are not 'charset'" > > - s/six/seven/ > > Fixed. > > > On Thu, 14 Jun 2007, Henri Sivonen wrote: > > > > As written, the charset sniffing algorithm doesn't trim space characters > > from around the tentative encoding name. html5lib test case expect the > > space characters to be trimmed. > > > > I suggest trimming space characters (or anything <= 0x20 depending on > > which approach is the right for compat). > > Actually it seems browsers don't do any trimming here. I've added a > comment to that effect. > > > On Sat, 23 Jun 2007, Øistein E. Andersen wrote: > > >> > > >>> Bytes or sequences of bytes in the original byte stream that could > > >>> not be converted to Unicode characters must be converted to U+FFFD > > >>> REPLACEMENT CHARACTER code points. > > >> > > >> [This does not specify the exact number of replacement chracters.] > > > > > > I don't really know how to define this. > > > > Unicode 5.0 remains vague on this point. (E.g., definition D92 defines > > well-formed and ill-formed UTF-8 byte sequences, but conformance > > requirement C10 only requires ill-formed sequences to be treated as an > > error condition and suggests that a one-byte ill-formed sequence may be > > either filtered out or replaced by a U+FFFD replacement character.) More > > generally, character encoding specifications can hardly be expected to > > define proper error handling, since they are usually not terribly > > preoccupied with mislabelled data. > > They should define error handling, and are defective if they don't. > However, I agree that many specs are defective. This is certainly not > limited to character encoding specifications. > > > > The current text may nevertheless be two liberal. It would notably be > > possible to construct an arbitrarily long Chinese text in a legacy > > encoding which -- according to the spec -- could be replaced by one > > single U+FFFD replacement character if incorrectly handled as UTF-8, > > which might lead the user to think that the page is completely > > uninteresting and therefore move on, whereas a larger number of > > replacement characters would have led him to try another encoding. (This > > is only a problem, of course, if an implementor chooses to emit the > > minimal number of replacement characters sanctioned by the spec.) > > Yes, but this is a user interface issue, not an interoperability issue, so > I don't think we need to be concerned about it. > > > On Thu, 2 Aug 2007, Henri Sivonen wrote: > > > On Aug 2, 2007, at 10:11, Ian Hickson wrote: > > > > > Would a non-normative note help here? Something like: > > > > > > Note: Bytes or sequences of bytes in the original byte stream that did > > > not conform to the encoding specification (e.g. invalid UTF-8 byte > > > sequences in a UTF-8 input stream) are errors that conformance > > > checkers are expected to report. > > > > > > ...to be put after the paragraph that reads "Bytes or sequences of > > > bytes in the original byte stream that could not be converted to > > > Unicode characters must be converted to U+FFFD REPLACEMENT CHARACTER > > > code points". > > > > Yes, this is what I meant with "a note hinting the consequences. > > Ok, added. > > > > > (Note that not all bytes or sequences of bytes in the original byte > > > stream that could not be converted to Unicode characters are > > > necessarily errors. It could just be that the encoding has a character > > > set that isn't a subset of Unicode, e.g. the Apple logo found in most > > > Apple character sets doesn't have a non-PUA analogue in Unicode. Its > > > presence in an HTML document isn't an error as far as I'm concerned.) > > > > Since XML and HTML5 are defined in terms of Unicode, characters there's > > nowhere to go except error and REPLACEMENT CHARACTER or the PUA for > > characters that aren't in Unicode. I'd steer clear of this in the spec > > an let decoders choose between de facto PUA assignments (like U+F8FF for > > the Apple logo) and errors. > > Yeah I don't have any intention on mentioning this in the spec. > > > On Wed, 31 Oct 2007, Martin Duerst wrote: > > > > [8.2.2.1] > > > > In point 3., it's not completely clear whether the encoding returned is > > e.g. "UTF-16BE BOM" or "UTF-16BE". Probably the best thing editorially > > is to move the word BOM from the description column of the table to the > > text prior to the table. > > Fixed. > > > > In point 7, what I find unnecessary is the repeated mention of heuristic > > algorithms, which are already mentioned previously in point 6. > > The heuristics in step 6 are for detemrining an encoding based on the byte > stream, e.g. using frequency analysis. The heuristics in step 7 are for > picking a default once that has failed. For example, if the defaults are > UTF-8 or Win1252, then you can determine which to pick by simply deciding > whether or not the stream is valid UTF-8. > > > > (I'm really interested what document [UNIVCHADET] is going to point to.) > > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html > > (It's in the source.) > > > > What I find missing/unclear is that the user can overwrite the page > > encoding manually. What is mentioned is a user-specificed default, which > > makes sense (e.g. "well, I'm mostly viewing Chinese pages, so I set my > > default to GB2132"). However, what we also need is the possibility for a > > user to override the encoding of a specific page (not changing the > > default). This is necessary because some pages are still mislabeled. > > When such an override is present, it should come before what's currently > > number 1. > > User agents can provide user interfaces to override anything they want, > e.g. they could provide an interface that changes all <script> elements > into <pre> elements on the fly, or whatever. Such behaviour is outside the > scope of the specification, since it is no longer about interoperability, > but about user control. It's technically non-compliant, because it is > doing something with the page that doesn't match what would happen for > other people (unless they _also_ overrode the spec behaviour). > > > > In 8.2.2.2, what I find unnecessary is that encodings such as UTF-7 are > > explicitly forbidden. I agree that these are virtually useless. However, > > I don't think implementing them would create any harm, and I don't think > > they should be dignified by even mentioning them. > > Sadly they do cause harm. The ones that are outlawed have all been used in > eithir actual attacks or proof-of-concept attacks described in > vulnerability reports, mostly due to their deceptive similarity to more > common encodings. (UTF-7 in particular has been used in a number of > attacks, because IE supported auto-detecting it, if I recall correctly.) > > > > In 8.2.2.4, I have no idea what's the reason or purpose of point 1, > > which reads "If the new encoding is UTF-16, change it to UTF-8.". I > > suspect some misunderstanding. > > This is required because many pages are labelled as UTF-16 but actually > use UTF-8. For example: > > http://www.zingermans.com > > > > Well, now let's get back to CharMod, and to the place where I think you > > need to do more work. HTML5 currently says "treat data labeled > > iso-8859-1 as windows-1252". This conflicts with C025 of CharMod > > (http://www.w3.org/TR/charmod/#C025): > > > > C025 [I] [C] An IANA-registered charset name MUST NOT be used to label > > text data in a character encoding other than the one identified in the > > IANA registration of that name. > > > > and also C030 (http://www.w3.org/TR/charmod/#C030): C030 [I] When an > > IANA-registered charset name is recognized, receiving software MUST > > interpret the received data according to the encoding associated with > > the name in the IANA registry. > > > > So the following sentence: > > > > "When a user agent would otherwise use the ISO-8859-1 encoding, it must > > instead use the Windows-1252 encoding." > > > > from HTML5 is clearly not conforming to CharMod. > > Indeed, it says so explicitly in the spec. > > > > Please note that the above items (C025 and C030) say that they only > > affect implementations ([I]) and content ([C]), but I think the main > > reason for this is that we never even immagined that a spec would say > > "you must treat FOO as BAR". > > > > I don't disagree with 'widely deployed', but I think one main reason for > > this is that it took ages to get windows-1252 registered. I think there > > are other ways to deal with this issue than a MUST. One thing that I > > guess you could do is to just describe current practice. > > Well, what we're describing is what an implementation has to do to be > compatible with the other implementations. And right now, this is one of > the things it has to do. > > > > This brings me to another point: The whole HTML5 spec seems to be > > written with implementers, and implementers only, in mind. This is great > > to help get browser behavior aligned, but it creates an enormous > > problem: The majority of potential users of the spec, namely creators of > > content, and of tools creating content, are completely left out. As an > > example, trying to reverse-engineer how to indicate the character > > encoding inside an HTML5 document from point 4 in 8.2.2.1 is completely > > impossible for content creators, webmasters, and the like. > > Section "8.2 Parsing HTML documents" is indeed exclusively for user agent > implementors and conformance checker implementors. For authors and > authoring tool implementors, you want section "8.1 Writing HTML documents" > and section "3.7.5.4. Specifying the document's character encoding" (which > is linked to from 8.1). These give the flipside of these requirements, the > authoring side. > > > On Sat, 3 Nov 2007, Addison Phillips wrote: > > > > -- > > Otherwise, return an implementation-defined or user-specified default > > character encoding, with the confidence tentative. Due to its use in > > legacy content, windows-1252 is recommended as a default in > > predominantly Western demographics. In non-legacy environments, the more > > comprehensive UTF-8 encoding is recommended instead. Since these > > encodings can in many cases be distinguished by inspection, a user agent > > may heuristically decide which to use as a default. > > -- > > > > Our comment is that this is a pretty weak recommendation. It is > > difficult to say what a "Western demographic" means in this context. We > > think we know why this is here: untagged HTML4 documents have a default > > character encoding of ISO 8859-1, so it is unsurprising to assume its > > common superset encoding when no other encoding can be guessed. > > > > However, we would like to see several things happen here: > > > > 1. It never actually says anywhere why windows-1252 must be used instead > > of ISO 8859-1. > > This is required in "Preprocessing the input stream". > > > > 2. As quoted, it seems to (but does not actually) favor 1252 over UTF-8. > > Since UTF-8 is highly detectable and also the best long-term general > > default, we'd prefer if the emphasis were reversed, dropping the > > reference to "Western demographics". For example: > > > > -- > > Otherwise, return an implementation-defined or user-specified default > > character encoding, with the confidence tentative. UTF-8 is recommended > > as a default encoding in most cases. Due to its use in legacy content, > > windows-1252 is also recommended as a default. Since these encodings can > > usually be distinguished by inspection, a user agent may heuristically > > decide which to use as a default. > > -- > > I've reversed the order, though not removed the mention of the Western > demographic, which I think is actually quite accurate and genernally more > understandable than, say, occidental. I would like to know what the more > common codecs are in oriental demographics, though, to broaden the use of > the recommendations. > > > > 3. Possibly something should be said (elsewhere, not in this paragraph) > > about using other "superset" encodings in preference to the explicitly > > named encoding (that is, other encodings bear the same relationship as > > windows-1252 does to iso8859-1 and user-agents actually use these > > encodings to interpret pages and/or encode data in forms, etc.) > > Is the current (new) text sufficient in this regard? See also the earlier > comments for details on the decisions behind the new text. > > > On Thu, 6 Dec 2007, Sam Ruby wrote: > > Ian Hickson wrote: > > > On Wed, 5 Dec 2007, Sam Ruby wrote: > > > > Henri Sivonen wrote: > > > > > I identified four classes of errors: > > > > > 1) meta charset in XHTML > > > > Why specifying a charset that matches the encoding is flagged as an > > > > error is probably something that should be discussed another day. > > > > I happen to believe that people will author content intended to be > > > > used by multiple user agents which are at various levels of spec > > > > conformance. > > > > > > That's actually an XML issue -- XML says the encoding should be in the > > > XML declaration, so HTML tries to not step on its toes and says that > > > the charset declaration shouldn't be included in the markup. (The spec > > > has to say that the UA must ignore that line anyway, so it's not clear > > > that there's any benefit to including it.) > > > > If the declaration clashed, I could see the value in an error message, > > but as I said, this can be discussed another day. > > Is it another day yet? :-) > > > On Fri, 25 Jan 2008, Frank Ellermann wrote: > > > > Hi, the chapter about "acceptable" charsets (8.2.2.2) is messy. Clearly > > UTF-8 and windows-1252 are popular, and you have that. > > > > What you need as a "minimum" for new browsers is UTF-8, US-ASCII (as > > popular proper subset of UTF-8), ISO-8859-1 (as HTML legacy), and > > windows-1252 for the reasons stated in the draft, supporting Latin-1 but > > not windows-1252 would be stupid. > > Right, that's what the draft current requires. > > > > BTW, I'm not aware that windows-1252 is a violation of CHARMOD, I asked > > a question about it and C049 in a Last Call of CHARMOD. > > See one of the earlier e-mails in this compound reply for the reasoning. > > > > Please s/but may support more/but should support more/ - the minimum is > > only that, the minimum. > > "SHOULD" has very strong connotations that I do not think apply here. In > particular, it makes no sense to have an open-ended SHOULD in this > context. > > > > | User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU > > | encodings > > > > I can see a MUST NOT for UTF-7 and CESU-8. And IMO the only good excuse > > for legacy charsets is backwards compatibility. But that is at worst a > > "SHOULD NOT" for BOCU-1, as you have it for UTF-32. > > > > I refuse to discuss SCSU, but MUST NOT is rather harsh, isn't it ? > > As noted earlier, these requirements are derived from real or potential > security vulnerabilities. > > > > In 3.7.5.4 you say: > > > > | Authors should not use JIS_X0212-1990, x-JIS0208, and encodings > > | based on EBCDIC. Authors should not use UTF-32. > > > > What's the logic behind these recommendations ? Of course EBCDIC > > is rare (as far as HTML is concerned I've never seen it), but it's > > AFAIK not worse than codepage 437, 850, 858, or similar charsets. > > Those are non-US-ASCII-compatible encodings. For further reasoning see the > thread that resulted in: > > http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/011949.html > > > > And UTF-32 is relatively harmless, not much worse than UTF-16, it > > belongs to the charsets recommended in CHARMOD. Depending on what > > happens in future Unicode versions banning UTF-32 could backfire. > > Actually UTF-32 is quite harmful, due to its extra cost in implementation, > its very limited testing, and the resulting bugs in almost all known > implementations. > > > > There are lots of other charsets starting with UTF-1 that could be > > listed as SHOULD NOT or even MUST NOT. Whatever you pick, state what > > your reasons are, not only the (apparently) arbitrary result. > > The reasons are sometimes rather involved or subtle, and I'd rather not > have the specification defend itself. It's a spec, not a positon paper. :-) > > > > Please make sure that all *unregistered* charsets are SHOULD NOT. Yes, I > > know the consequences for some proprietary charsets, they are free to > > register them or to be ignored (CHARMOD C022). > > It's already a must ("The value must be a valid character encoding name, > and must be the preferred name for that encoding."). > > > On Tue, 29 Jan 2008, Brian Smith wrote: > > Henri Sivonen wrote: > > > My understanding is that HTML 5 bans these post-UTF-8 > > > second-system Unicode encodings no matter where you might > > > declare the use. > > > > It is in section 3.7.5 (the META element), and not in section 8 (The > > HTML Syntax), and the reference to section 3.7.5 in section 8 says that > > the restrictions apply (only) in a (<META>) character encoding > > declaration. So, it seems the real issue is just clarifying the text in > > 3.7.5.4 to indicate that those restrictions apply only when the META > > charset override mechanism is being used. > > I don't understand. > > > > > The purpose of the HTML 5 spec is to improve interoperability between > > > Web browsers as used with content and Web apps published on the one > > > public Web. The normative language in the spec is concerned with > > > publishing and consuming content and apps on the Web. The purpose of > > > the spec isn't to lower the R&D cost of private and proprietary > > > systems by producing reusable bits. > > > > Then why doesn't the specification list the encodings that conformant > > web browsers are required to support, instead of listing the encodings > > that document authors are forbidden from using. > > Because former the list is open-ended, whereas the latter list is not, > and the latter list is more important. > > > > > > Even after Unicode and the UTF encodings, new encodings are still > > > > being created. > > > > > > Deploying such encodings on the public network is a colossally bad > > > idea. (My own nation has engaged in this folly with ISO-8859-15, so > > > I've seen the bad consequences at home, too.) > > > > That is exactly my point. If the intention is that BOCU-1 should be > > prohibited, then shouldn't ISO-8859-15 be prohibited for the same > > reason? Why one and not the other? > > One is used. The other is not. It really is that simple. We can stop the > madness for one of them, but it's too late for the other. > > > > Anyway, I am pretty sure that the restriction against BOCU and similar > > encodings is just to make it possible to correctly parse the <META> > > charset override, not to prevent their use altogether. The language just > > needs to be made clearer. > > As the spec says, "authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU > encodings". There's no limitation to <meta> or anything. They are just > banned outright. > > > On Thu, 31 Jan 2008, Henri Sivonen wrote: > > > > I ran an analysis on recent error messages from Validator.nu. > > http://hsivonen.iki.fi/test/moz/analysis.txt > > Looking at this from the point of view of encodings, I see the following > common errors: > > * <meta charset> not being at the top of <head> > * missing explicit character encoding declaration > * <meta content=""> not starting with text/html > * unpreferred encoding names > > I think all of these are real errors, and I don't think we should change > the spec's encoding rules based on this data. > > Thanks for this data. Basing spec development on real data like this is of > huge value. > > > On Thu, 31 Jan 2008, Sam Ruby wrote: > > > > > > I think we should allow the old internal encoding declaration syntax > > > for text/html as an alternative to the more elegant syntax. Not > > > declaring the encoding is bad, so we shouldn't send a negative message > > > to the authors who are declaring the encoding. Moreover, this is > > > interoperable stuff. > > > > > > I think we shouldn't allow this for application/xhtml+xml, though, > > > because authors might think it has an effect. > > > > By that reasoning, a meta charset encoding declaration should not be > > allowed if a charset is specified on the Content-Type HTTP header. I > > ran into that very problem today: > > > > http://lists.planetplanet.org/archives/devel/2008-January/001747.html > > > > This content was XHTML, but was served as text/html, with a charset > > specified on the HTTP header, which overrode the charset on the meta > > declaration. > > If they don't match, then there's an error (forcibly so, since one of the > two encodings has to be wrong!). > > > > Serving XHTML as text/html, with BOTH a charset specified on the HTTP > > header AND a meta charset specified just in case is more common than you > > might think. > > It's not a recommended behaviour, though. Just pick one and use it. The > practice of making documents schizophrenic like this is a side-effect of > the market not fully supporting XHTML (i.e. IE). If it wasn't for that, > people wouldn't be as determined to give their documents identity crises. > > > > A much more useful restriction -- spanning both the HTML5 and XHTML5 > > serializations -- would be to issue an error if multiple sources for > > encoding information were explicitly specified and if they differ. > > That's already required. > > > On Mon, 11 Feb 2008, Henri Sivonen wrote: > > > > > > A much more useful restriction -- spanning both the HTML5 and XHTML5 > > > serializations -- would be to issue an error if multiple sources for > > > encoding information were explicitly specified and if they differ. > > > > I agree. I had already implemented this as a warning on the XML side. > > (Not as an error because I'm not aware of any spec that I could justify > > for calling it an error.) > > If the declarations disagree, one of them is wrong. It's an error for the > declaration to be wrong. > > > > While I was at it, I noticed that the spec (as well as Gecko) don't > > require http-equiv='content-type' when looking for a content attribute > > that looks like an internal encoding declaration. Therefore, I also > > added a warning that fires if the value of a content attribute would be > > sniffed as an internal character encoding declaration but a > > http-equiv='content-type' is missing. > > It's an error according to the spec. > > > On Fri, 1 Feb 2008, Henri Sivonen wrote: > > > > But surely the value for content should be ASCII-case-insensitive. > > Ok. > > > > Also, why limit the space to one U+0020 instead of zero or more space > > characters? > > Ok, allowed any number of space characters (and any space characters). > > -- > Ian Hickson U+1047E )\._.,--....,'``. fL > http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. > Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.' -- Mark
Received on Friday, 29 February 2008 01:43:10 UTC