- From: Mark Davis <mark.davis@icu-project.org>
- Date: Thu, 28 Feb 2008 17:42:58 -0800
- To: "Ian Hickson" <ian@hixie.ch>
- Cc: public-i18n-core@w3.org
On Thu, Feb 28, 2008 at 5:21 PM, Ian Hickson <ian@hixie.ch> wrote:
>
> Executive summary: I made a number of changes, as described below, in
> response to the feedback on character encodings in HTML. They are covered
> by revisions 1263 to 1275 of the spec source.
>
> I have cc'ed most (though not all) of the mailing lists that were
> originally cc'ed on the messages to which I reply below, to keep everyone
> in the loop. Please, for everyone's sake, pick a single mailing list when
> replying, and trim the quotes to just the bits to which you are replying.
> Don't include the whole of this e-mail in your reply! Thanks.
>
>
> On Sun, 5 Nov 2006, Øistein E. Andersen wrote, in reply to Henri:
> > >
> > > I think conforming text/html documents should not be allowed to parse
> > > into a DOM that contains characters that are not allowed in XML 1.0.
> > > [...] I am inclined to prefer [...] U+FFFD
>
> (I've made the characters not allowed in XML also not allowed in HTML,
> with the exception of some of the space characters which we need to have
> allowed for legacy reasons.)
>
>
> > I perfectly agree. (Actually, i think that U+7F (delete) and the C1
> > control characters should be excluded [transformed into U+FFFD] as well,
> > but this could perhaps be problematic due to spurious CP1252
> > characters.)
>
> I've made them illegal but not converted them to FFFD.
>
>
> On Mon, 6 Nov 2006, Lachlan Hunt wrote:
> >
> > At the very least, ISO-8859-1 must be treated as Windows-1252. I'm not
> > sure about the other ISO-8859 encodings. Numeric and hex character
> > references from 128 to 159 must also be treated as Windows-1252 code
> > points.
>
> All already specified.
>
>
> On Sun, 5 Nov 2006, Elliotte Harold wrote:
> >
> > The specific problem is that an author may publish a correctly labeled
> > UTF-8 or ISO-8859-8 document or some such. However the server sends a
> > Content-type header that requires the parser to treat the document as
> > ISO-8859-1 or US-ASCII or something else.
> >
> > The need is for server administrators to allow content authors to
> > specify content types and character sets for the documents they write.
> > The content doesn't need to change. The authors just need the ability to
> > specify the server headers for their documents.
>
> Well, we can't change the way this works from this side, so it's not
> really our problem at this point.
>
>
> On Sat, 23 Dec 2006, Henri Sivonen wrote:
> >
> > http://www.elementary-group-standards.com/web-standards/html5-http-equiv-difference.html
> >
> > In short, some authors want to use <meta http-equiv="imagetoolbar"
> > content="no"> but (X)HTML5 doesn't allow it.
> >
> > Personally, I think that authors who want to disable *User* Agent
> > features like that are misguided.
> >
> > Anyway, I thought I'd mention this so that the issue gets informed as
> > opposed to accidental treatment.
>
> Proprietary extensions to HTML are just that, proprietary extensions, and
> are therefore by intentionally not conforming.
>
>
> On Mon, 26 Feb 2007, Lachlan Hunt wrote:
> >
> > Given that the spec now says that ISO-8859-1 must be treated as
> > Windows-1252, should it still be considered an error to use the C1
> > control characters (U+0080 to U+009F) if ISO-8859-1 is declared?
> >
> > Some relevant messages from IRC:
> >
> > [15:59] <Lachy> since the spec says if ISO-8859-1 is declared, Windows-1252
> > must be used. Is it still an error for authors to use the C1 control
> > characters in the range 128-159?
> > [16:23] <Hixie> Lachy: not sure what we should do, there's a bunch of corner
> > cases there. like, should we allow control chars anyway, should we allow
> > ISO-8859-1 to be declared but Win1252 to be used, etc.
> > [16:23] <Hixie> Lachy: can you mail the list with suggestions and a list of
> > the cases you can think of that we should cover?
> > [16:27] <Lachy> I'm having a hard time deciding if it should be allowed or not
> > [16:28] <Lachy> Technically, it is an error and I think users should be
> > notified, but it's practically harmless these days and very common.
> > [16:30] <Lachy> Yet, doing the same thing in XML doesn't work, since XML
> > parsers do treat them as control characters
>
> I've made it be a parse error. I'm sure implementing this is going to very
> exciting for Henri.
>
>
> On Thu, 1 Mar 2007, Henri Sivonen wrote:
> >
> > I think that encoding information should be included in the HTTP
> > payload. In my opinion, the spec should not advice against this.
> > Preferably, it would encourage putting the encoding information in the
> > payload. (The BOM or, in the case of XML, the UTF-8 defaulting of the
> > XML sniffing algorithm are fine.)
>
> I can't seem to find the part of the spec that recommends the opposite of
> this... did I already remove it? I'm happy to make the spec silent on this
> point, since experts disagree.
>
>
> On Sun, 11 Mar 2007, Geoffrey Sneddon wrote:
> >
> > From implementing parts of the input stream (section 8.2.2 as of
> > writing) yesterday, I found several issues (some of which will show the
> > asshole[1] within me):
> >
> > - Within the step one of the get an attribute sub-algorithm it says
> > "start over" – is this start over the sub-algorithm or the whole algorithm?
>
> Fixed.
>
>
> > - Again in step one, why do we need to skip whitespace in both the
> > sub-algorithm and at section one of the inner step for <meta> tags?
>
> Otherwise, the <meta bit would be pointing at the "<" and would treat
> "meta" as an attribute name.
>
>
> > - In step 11, when we have anything apart from a double/single quote
> > or less/greater than sign, we add it to the value, but don't move the position
> > forward, so when we move onto step 12 we add it again.
>
> Yes, valid point. Fixed.
>
>
> > - In step 3 of the very inner set of steps for a content attribute in
> > a <meta> tag, is charset case-sensitive?
>
> Doesn't matter, the parser lowercases everything anyway.
>
>
> > - Again there, shouldn't we be given unicode codepoints for that (as
> > it'll be a unicode string)?
>
> Not sure what you mean.
>
>
> On Sat, 26 May 2007, Henri Sivonen wrote:
> >
> > The draft says:
> > "A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present."
> >
> > That's reasonable for UTF-8 when the encoding has been established by
> > other means.
> >
> > However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed to be
> > signatureless), do we really want to drop the BOM silently? Shouldn't it
> > count as a character that is in error?
>
> Do the UTF-16LE and UTF-16BE specs make a leading BOM an error?
>
> If yes, then we don't have to say anything, it's already an error.
>
> If not, what's the advantage of complaining about the BOM in this case?
>
>
> > Likewise, if an encoding signature BOM has been discarded and the first
> > logical character of the stream is another BOM, shouldn't that also
> > count as a character that is in error?
> >
> > I think I should elaborate that when the encoding is UTF-16 (not
> > UTF-16LE or UTF-16BE), the BOM is gets swallowed by the character
> > decoding layer (in reasonable decoder implementations) and is not
> > returned from the character stream at all. Therefore, on the character
> > level, a droppable BOM only occurs in UTF-8 when the encoding was
> > established by other means.
>
> The spec says: "Given an encoding, the bytes in the input stream must be
> converted to Unicode characters for the tokeniser, as described by the
> rules for that encoding, except that leading U+FEFF BYTE ORDER MARK
> characters must not be stripped by the encoding layer."
>
>
> On Mon, 28 May 2007, Henri Sivonen wrote:
> >
> > To this end, I think at least for conforming documents the algorithm for
> > establishing the character encoding should be deterministic. I'd like to
> > request two things:
> >
> > 1) When sniffing for meta charset, the current draft allows a use agent
> > to give up sooner than after examining the first 512 bytes. To make meta
> > charset sniffing reliable and deterministic so that it doesn't depend on
> > flukes in buffering, I think UAs should (if there's no transfer protocol
> > level charset label and no BOM) be required to consumer bytes until they
> > find a meta charset, reach the EOF or have examined 512 bytes. That is,
> > I think UAs should not be allowed to give up earlier. (On the other
> > hand, I think UAs should be allowed to start examining the byte stream
> > before 512 have been buffered without an IO error, since in general,
> > byte stream buffer management should be up to the IO libraries and
> > outside the scope of the HTML spec.)
>
> I don't want to do this because I don't want to require that browsers
> handle a CGI script that outputs 500 bytes than hangs for a minute in a
> way that doesn't render anything for a minute, and I don't want to require
> that people writing such CGI scripts front-load a 512 byte comment.
>
> We've already conceeded that a page can document.write() an encoding
> declaration after 6 megabytes of content and end up causing a reparse.
>
>
> > 2) Since the chardet step is optional and the spec doesn't make the
> > Mozilla chardet behavior normative, I think the document should be
> > considered non-conforming if the algorithm for establishing the
> > character encoding proceeds to steps 6 (chardet) or 7 (last resort
> > default).
>
> That would make most of my pages non-conforming. It would make this
> non-conforming:
>
> <!DOCTYPE HTML>
> <html>
> <head>
> <title> Example </title>
> </head>
> <body>
> <p> I don't want to be non-conforming! </p>
> </body>
> </html>
>
>
> > It wouldn't hurt, though, to say in the section on writing documents that at
> > least one of the following is required for document conformance:
> > * A transfer protocol-level character encoding declaration.
> > * A meta charset within the first 512 bytes.
> > * A BOM.
>
> We already require that, though without the 512 byte requirement.
>
>
> On Tue, 29 May 2007, Henri Sivonen wrote:
> >
> > To avoid stepping on the toes of Charmod more than is necessary, I
> > suggest making it non-conforming for a document to have bytes in the
> > 0x80…0x9F range when the character encoding is declared to be one of the
> > ISO-8859 family encodings.
>
> Done, I believe.
>
>
> > (UA conformance requires in some cases these bytes to be decoded in a
> > Charmod-violating way, but reality trumps Charmod for UA conformance.
> > While I'm at it: Surely there are other ISO-8859 family encodings
> > besides ISO-8859-1 that require decoding using the corresponding
> > windows-* family decoder?)
>
> Maybe; anyone have any concrete information?
>
>
> On Tue, 29 May 2007, Maciej Stachowiak wrote:
> >
> > I don't know of any ISO-8859 encodings requiring this, but for all
> > unicode encodings and numeric entity references compatibility requires
> > interpreting this range of code points in the WinLatin1 way.
>
> On Mon, 4 Jun 2007, Henri Sivonen wrote:
> >
> > I tested with Firefox 2.0.4, Minefield, Safari 2.0.4, WebKit nightly and
> > Opera 9.20 (all on Mac). Only Safari 2.0.4 gives the DWIM treatment the
> > C1 code point range in UTF-8 and UTF-16.
> >
> > This makes me suspect that compatibility with the Web doesn't really
> > require the DWIM treatment here. What does IE7 do?
> >
> > The data I used: http://hsivonen.iki.fi/test/utf-c1/
>
> IE7 and Safari 3 do the same as the other browsers, namely, no DWIM
> treatment.
>
> So, I haven't changed the spec.
>
>
> On Fri, 1 Jun 2007, Henri Sivonen wrote:
> >
> > The anomalies seem to be:
> > 1) ISO-8859-1 is decoded as Windows-1252.
> > 2) 0x85 in ISO-8859-10 and in ISO-8859-16 is decoded as in Windows-1252
> > (ellipsis) by Gecko.
> > 3) ISO-8859-11 is decoded as Windows-874.
> >
> > I was rather surprised by the results. They weren't at all what I expected.
> > Test data: http://hsivonen.iki.fi/test/iso8859/
> >
> > I suggest adding the ISO-8859-11 to Windows-874 mapping to the spec.
>
> On Fri, 1 Jun 2007, Henri Sivonen wrote:
> >
> > By Firefox and Opera. Safari doesn't support ISO-8859-11 and I was
> > unable to test IE.
>
> On Fri, 1 Jun 2007, Simon Pieters wrote:
> >
> > IE7 and Opera handle ISO-8859-11.htm the same, AFAICT.
>
> I did some studies and there appear to be enough pages as ISO-8859-11 to
> add this. I didn't check how many had bytes in the affected range, which
> maybe would be worth checking, though.
>
>
> On Sat, 2 Jun 2007, Øistein E. Andersen wrote:
> >
> > As suggested earlier [1], a simpler solution seems to be to treat C1
> > bytes and NCRs from /all/ ISO-8859-* and Unicode encodings as
> > Windows-1252.
>
> That seems excessive.
>
>
> On Tue, 5 Jun 2007, Henri Sivonen wrote:
> > >
> > > To avoid stepping on the toes of Charmod more than is necessary, I
> > > suggest making it non-conforming for a document to have bytes in the
> > > 0x80…0x9F range when the character encoding is declared to be one of
> > > the ISO-8859 family encodings.
> >
> > I've been thinking about this. I have a proposal on how to spec this
> > *conceptually* and how to implement this with error reporting. I am
> > assuming here that 1) No one ever intends C1 code points to be present
> > in the decoded stream and 2) we want, as a Charmod correctness fig leaf,
> > to make the C1 bytes non-conforming when ISO-8859-1 or ISO-8859-11 was
> > declared but Windows-1252 or Windows-874 decoding is needed.
>
> I really don't care too much about the fig leaf part.
>
>
> > Based on the behavior of Minefield and Opera 9.20, the following seems
> > to be the least Charmod violating and least quirky approach that could
> > possibly work:
> >
> > 1) Decode the byte stream using a decoder for whatever encoding was declared,
> > even ISO-8859-1 or ISO-8859-11, according to ftp://
> > ftp.unicode.org/Public/MAPPINGS/.
> > 2) If a character in the decoded character stream is in the C1 code point
> > range, this is a document conformance violation.
> > 2a) If the declared encoding was ISO-8859-1, replace that character with
> > the character that you get by casting the code point into a byte and decoding
> > it as Windows-1252.
> > 2b) If the declared encoding was ISO-8859-11, replace that character with
> > the character that you get by casting the code point into a byte and decoding
> > it as Windows-874.
>
> That sounds far more complex than what we have now.
>
>
> On Tue, 5 Jun 2007, Kristof Zelechovski wrote:
> >
> > 2c) If the declared encoding was ISO-8859-2, replace that character
> > with the character that you get by casting the code point into a byte
> > and decoding it as Windows-1250.
>
> On Tue, 5 Jun 2007, Henri Sivonen wrote:
> >
> > As far as I can tell, that's not what Firefox, Minefield, Opera 9.20 and
> > WebKit nightlies do, so apparently it is not required for compatibility
> > with a notable number of pages.
>
> Indeed.
>
>
> On Tue, 5 Jun 2007, Maciej Stachowiak wrote:
> >
> > What we actually do in WebKit is always use a windows-1252 decoder when
> > ISO-8859-1 is requested. I don't think it's very helpful to make all
> > documents that declare a ISO-8859-1 encoding and use characters in the
> > C1 range nonconforming. It's true that they are counting on nonstandard
> > processing of the nominally declared encoding, but I don't think that
> > causes a problem in practice, as long as the rule is well known. It
> > seems simpler to just make latin1 an alias for winlatin1.
>
> I agree.
>
>
> On Fri, 1 Jun 2007, Raphael Champeimont (Almacha) wrote:
> >
> > I think there is something wrong in the "get an attribute" algorithm
> > from 8.2.2. The input stream.
> >
> > Between steps 11 and 12 I think there is a missing:
> >
> > 11b: Advance position to the next byte.
> >
> > With the current algorithm, if I write <meta charset = ascii> it will
> > say the value of attribute charset is "aascii" with one too much leading
> > A
> >
> > The reason is that in step 11 if we fall in case "Anything else" we add
> > the new char to the string, and then if we fall in "Anything else" in
> > step 12 we add again the *same* char to the string, so the first char of
> > the attribute value appears 2 times.
>
> Fixed. (Though please check. I made several changes to this algorithm and
> would be happier if I knew someone had proofread the changes!)
>
>
> On Fri, 1 Jun 2007, Henri Sivonen wrote:
> >
> > In the charset meta sniffing algorithm under "Attribute name:":
> >
> > > If it is 0x2F (ASCII '/'), 0x3C (ASCII '<'), or 0x3E (ASCII '>')
> > > Stop looking for an attribute. The attribute's name is the value of
> > > attribute name, its value is the empty string.
> >
> > In general, it seems to me the algorithm isn't quite clear on when to
> > stop looking for the current attribute and when to stop looking for
> > attributes for the current tag altogether.
>
> The spec never distinguishes these two cases in the "get an attribute"
> algorithm -- the algorithm that invokes the "get an attribute" algorithm
> is the one that decides how often it is done.
>
>
> > In this step, it seems to me that '/' should advance the pointer and end
> > getting the current attribute followed by getting another attribute. '>'
> > should end getting attributes on the whole tag without changing the
> > pointer.
>
> It doesn't matter. Both return an attribute, then the invoking algorithm
> retries and if that results in no attribute (because you're on the ">")
> then you stop looking for the tag.
>
>
> On Fri, 1 Jun 2007, Henri Sivonen wrote:
> >
> > The spec probably needs to be made more specific about the case where
> > the ASCII byte-based algorithm finds a supported encoding name but the
> > encoding is not a rough ASCII superset.
> >
> > 23:46 < othermaciej> one quirk in Safari is that if there's a meta tag
> > claiming
> > the source is utf-16, we treat it as utf-8
> > ...
> > 23:48 < othermaciej> hsivonen: there is content that needs it
> > ...
> > 23:52 < othermaciej> hsivonen: I think we may treat any claimed unicode
> > charset
> > in a <meta> tag as utf-8
>
> Oops, I had this for the case where utf-16 was detected on the fly, but
> not for the preparser. Fixed.
>
>
> On Sat, 2 Jun 2007, Philip Taylor wrote:
> >
> > 8.2.2. The input stream: "If the next six characters are not 'charset'"
> > - s/six/seven/
>
> Fixed.
>
>
> On Thu, 14 Jun 2007, Henri Sivonen wrote:
> >
> > As written, the charset sniffing algorithm doesn't trim space characters
> > from around the tentative encoding name. html5lib test case expect the
> > space characters to be trimmed.
> >
> > I suggest trimming space characters (or anything <= 0x20 depending on
> > which approach is the right for compat).
>
> Actually it seems browsers don't do any trimming here. I've added a
> comment to that effect.
>
>
> On Sat, 23 Jun 2007, Øistein E. Andersen wrote:
> > >>
> > >>> Bytes or sequences of bytes in the original byte stream that could
> > >>> not be converted to Unicode characters must be converted to U+FFFD
> > >>> REPLACEMENT CHARACTER code points.
> > >>
> > >> [This does not specify the exact number of replacement chracters.]
> > >
> > > I don't really know how to define this.
> >
> > Unicode 5.0 remains vague on this point. (E.g., definition D92 defines
> > well-formed and ill-formed UTF-8 byte sequences, but conformance
> > requirement C10 only requires ill-formed sequences to be treated as an
> > error condition and suggests that a one-byte ill-formed sequence may be
> > either filtered out or replaced by a U+FFFD replacement character.) More
> > generally, character encoding specifications can hardly be expected to
> > define proper error handling, since they are usually not terribly
> > preoccupied with mislabelled data.
>
> They should define error handling, and are defective if they don't.
> However, I agree that many specs are defective. This is certainly not
> limited to character encoding specifications.
>
>
> > The current text may nevertheless be two liberal. It would notably be
> > possible to construct an arbitrarily long Chinese text in a legacy
> > encoding which -- according to the spec -- could be replaced by one
> > single U+FFFD replacement character if incorrectly handled as UTF-8,
> > which might lead the user to think that the page is completely
> > uninteresting and therefore move on, whereas a larger number of
> > replacement characters would have led him to try another encoding. (This
> > is only a problem, of course, if an implementor chooses to emit the
> > minimal number of replacement characters sanctioned by the spec.)
>
> Yes, but this is a user interface issue, not an interoperability issue, so
> I don't think we need to be concerned about it.
>
>
> On Thu, 2 Aug 2007, Henri Sivonen wrote:
>
> > On Aug 2, 2007, at 10:11, Ian Hickson wrote:
> >
> > > Would a non-normative note help here? Something like:
> > >
> > > Note: Bytes or sequences of bytes in the original byte stream that did
> > > not conform to the encoding specification (e.g. invalid UTF-8 byte
> > > sequences in a UTF-8 input stream) are errors that conformance
> > > checkers are expected to report.
> > >
> > > ...to be put after the paragraph that reads "Bytes or sequences of
> > > bytes in the original byte stream that could not be converted to
> > > Unicode characters must be converted to U+FFFD REPLACEMENT CHARACTER
> > > code points".
> >
> > Yes, this is what I meant with "a note hinting the consequences.
>
> Ok, added.
>
>
> > > (Note that not all bytes or sequences of bytes in the original byte
> > > stream that could not be converted to Unicode characters are
> > > necessarily errors. It could just be that the encoding has a character
> > > set that isn't a subset of Unicode, e.g. the Apple logo found in most
> > > Apple character sets doesn't have a non-PUA analogue in Unicode. Its
> > > presence in an HTML document isn't an error as far as I'm concerned.)
> >
> > Since XML and HTML5 are defined in terms of Unicode, characters there's
> > nowhere to go except error and REPLACEMENT CHARACTER or the PUA for
> > characters that aren't in Unicode. I'd steer clear of this in the spec
> > an let decoders choose between de facto PUA assignments (like U+F8FF for
> > the Apple logo) and errors.
>
> Yeah I don't have any intention on mentioning this in the spec.
>
>
> On Wed, 31 Oct 2007, Martin Duerst wrote:
> >
> > [8.2.2.1]
> >
> > In point 3., it's not completely clear whether the encoding returned is
> > e.g. "UTF-16BE BOM" or "UTF-16BE". Probably the best thing editorially
> > is to move the word BOM from the description column of the table to the
> > text prior to the table.
>
> Fixed.
>
>
> > In point 7, what I find unnecessary is the repeated mention of heuristic
> > algorithms, which are already mentioned previously in point 6.
>
> The heuristics in step 6 are for detemrining an encoding based on the byte
> stream, e.g. using frequency analysis. The heuristics in step 7 are for
> picking a default once that has failed. For example, if the defaults are
> UTF-8 or Win1252, then you can determine which to pick by simply deciding
> whether or not the stream is valid UTF-8.
>
>
> > (I'm really interested what document [UNIVCHADET] is going to point to.)
>
> http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
>
> (It's in the source.)
>
>
> > What I find missing/unclear is that the user can overwrite the page
> > encoding manually. What is mentioned is a user-specificed default, which
> > makes sense (e.g. "well, I'm mostly viewing Chinese pages, so I set my
> > default to GB2132"). However, what we also need is the possibility for a
> > user to override the encoding of a specific page (not changing the
> > default). This is necessary because some pages are still mislabeled.
> > When such an override is present, it should come before what's currently
> > number 1.
>
> User agents can provide user interfaces to override anything they want,
> e.g. they could provide an interface that changes all <script> elements
> into <pre> elements on the fly, or whatever. Such behaviour is outside the
> scope of the specification, since it is no longer about interoperability,
> but about user control. It's technically non-compliant, because it is
> doing something with the page that doesn't match what would happen for
> other people (unless they _also_ overrode the spec behaviour).
>
>
> > In 8.2.2.2, what I find unnecessary is that encodings such as UTF-7 are
> > explicitly forbidden. I agree that these are virtually useless. However,
> > I don't think implementing them would create any harm, and I don't think
> > they should be dignified by even mentioning them.
>
> Sadly they do cause harm. The ones that are outlawed have all been used in
> eithir actual attacks or proof-of-concept attacks described in
> vulnerability reports, mostly due to their deceptive similarity to more
> common encodings. (UTF-7 in particular has been used in a number of
> attacks, because IE supported auto-detecting it, if I recall correctly.)
>
>
> > In 8.2.2.4, I have no idea what's the reason or purpose of point 1,
> > which reads "If the new encoding is UTF-16, change it to UTF-8.". I
> > suspect some misunderstanding.
>
> This is required because many pages are labelled as UTF-16 but actually
> use UTF-8. For example:
>
> http://www.zingermans.com
>
>
> > Well, now let's get back to CharMod, and to the place where I think you
> > need to do more work. HTML5 currently says "treat data labeled
> > iso-8859-1 as windows-1252". This conflicts with C025 of CharMod
> > (http://www.w3.org/TR/charmod/#C025):
> >
> > C025 [I] [C] An IANA-registered charset name MUST NOT be used to label
> > text data in a character encoding other than the one identified in the
> > IANA registration of that name.
> >
> > and also C030 (http://www.w3.org/TR/charmod/#C030): C030 [I] When an
> > IANA-registered charset name is recognized, receiving software MUST
> > interpret the received data according to the encoding associated with
> > the name in the IANA registry.
> >
> > So the following sentence:
> >
> > "When a user agent would otherwise use the ISO-8859-1 encoding, it must
> > instead use the Windows-1252 encoding."
> >
> > from HTML5 is clearly not conforming to CharMod.
>
> Indeed, it says so explicitly in the spec.
>
>
> > Please note that the above items (C025 and C030) say that they only
> > affect implementations ([I]) and content ([C]), but I think the main
> > reason for this is that we never even immagined that a spec would say
> > "you must treat FOO as BAR".
> >
> > I don't disagree with 'widely deployed', but I think one main reason for
> > this is that it took ages to get windows-1252 registered. I think there
> > are other ways to deal with this issue than a MUST. One thing that I
> > guess you could do is to just describe current practice.
>
> Well, what we're describing is what an implementation has to do to be
> compatible with the other implementations. And right now, this is one of
> the things it has to do.
>
>
> > This brings me to another point: The whole HTML5 spec seems to be
> > written with implementers, and implementers only, in mind. This is great
> > to help get browser behavior aligned, but it creates an enormous
> > problem: The majority of potential users of the spec, namely creators of
> > content, and of tools creating content, are completely left out. As an
> > example, trying to reverse-engineer how to indicate the character
> > encoding inside an HTML5 document from point 4 in 8.2.2.1 is completely
> > impossible for content creators, webmasters, and the like.
>
> Section "8.2 Parsing HTML documents" is indeed exclusively for user agent
> implementors and conformance checker implementors. For authors and
> authoring tool implementors, you want section "8.1 Writing HTML documents"
> and section "3.7.5.4. Specifying the document's character encoding" (which
> is linked to from 8.1). These give the flipside of these requirements, the
> authoring side.
>
>
> On Sat, 3 Nov 2007, Addison Phillips wrote:
> >
> > --
> > Otherwise, return an implementation-defined or user-specified default
> > character encoding, with the confidence tentative. Due to its use in
> > legacy content, windows-1252 is recommended as a default in
> > predominantly Western demographics. In non-legacy environments, the more
> > comprehensive UTF-8 encoding is recommended instead. Since these
> > encodings can in many cases be distinguished by inspection, a user agent
> > may heuristically decide which to use as a default.
> > --
> >
> > Our comment is that this is a pretty weak recommendation. It is
> > difficult to say what a "Western demographic" means in this context. We
> > think we know why this is here: untagged HTML4 documents have a default
> > character encoding of ISO 8859-1, so it is unsurprising to assume its
> > common superset encoding when no other encoding can be guessed.
> >
> > However, we would like to see several things happen here:
> >
> > 1. It never actually says anywhere why windows-1252 must be used instead
> > of ISO 8859-1.
>
> This is required in "Preprocessing the input stream".
>
>
> > 2. As quoted, it seems to (but does not actually) favor 1252 over UTF-8.
> > Since UTF-8 is highly detectable and also the best long-term general
> > default, we'd prefer if the emphasis were reversed, dropping the
> > reference to "Western demographics". For example:
> >
> > --
> > Otherwise, return an implementation-defined or user-specified default
> > character encoding, with the confidence tentative. UTF-8 is recommended
> > as a default encoding in most cases. Due to its use in legacy content,
> > windows-1252 is also recommended as a default. Since these encodings can
> > usually be distinguished by inspection, a user agent may heuristically
> > decide which to use as a default.
> > --
>
> I've reversed the order, though not removed the mention of the Western
> demographic, which I think is actually quite accurate and genernally more
> understandable than, say, occidental. I would like to know what the more
> common codecs are in oriental demographics, though, to broaden the use of
> the recommendations.
>
>
> > 3. Possibly something should be said (elsewhere, not in this paragraph)
> > about using other "superset" encodings in preference to the explicitly
> > named encoding (that is, other encodings bear the same relationship as
> > windows-1252 does to iso8859-1 and user-agents actually use these
> > encodings to interpret pages and/or encode data in forms, etc.)
>
> Is the current (new) text sufficient in this regard? See also the earlier
> comments for details on the decisions behind the new text.
>
>
> On Thu, 6 Dec 2007, Sam Ruby wrote:
> > Ian Hickson wrote:
> > > On Wed, 5 Dec 2007, Sam Ruby wrote:
> > > > Henri Sivonen wrote:
> > > > > I identified four classes of errors:
> > > > > 1) meta charset in XHTML
> > > > Why specifying a charset that matches the encoding is flagged as an
> > > > error is probably something that should be discussed another day.
> > > > I happen to believe that people will author content intended to be
> > > > used by multiple user agents which are at various levels of spec
> > > > conformance.
> > >
> > > That's actually an XML issue -- XML says the encoding should be in the
> > > XML declaration, so HTML tries to not step on its toes and says that
> > > the charset declaration shouldn't be included in the markup. (The spec
> > > has to say that the UA must ignore that line anyway, so it's not clear
> > > that there's any benefit to including it.)
> >
> > If the declaration clashed, I could see the value in an error message,
> > but as I said, this can be discussed another day.
>
> Is it another day yet? :-)
>
>
> On Fri, 25 Jan 2008, Frank Ellermann wrote:
> >
> > Hi, the chapter about "acceptable" charsets (8.2.2.2) is messy. Clearly
> > UTF-8 and windows-1252 are popular, and you have that.
> >
> > What you need as a "minimum" for new browsers is UTF-8, US-ASCII (as
> > popular proper subset of UTF-8), ISO-8859-1 (as HTML legacy), and
> > windows-1252 for the reasons stated in the draft, supporting Latin-1 but
> > not windows-1252 would be stupid.
>
> Right, that's what the draft current requires.
>
>
> > BTW, I'm not aware that windows-1252 is a violation of CHARMOD, I asked
> > a question about it and C049 in a Last Call of CHARMOD.
>
> See one of the earlier e-mails in this compound reply for the reasoning.
>
>
> > Please s/but may support more/but should support more/ - the minimum is
> > only that, the minimum.
>
> "SHOULD" has very strong connotations that I do not think apply here. In
> particular, it makes no sense to have an open-ended SHOULD in this
> context.
>
>
> > | User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
> > | encodings
> >
> > I can see a MUST NOT for UTF-7 and CESU-8. And IMO the only good excuse
> > for legacy charsets is backwards compatibility. But that is at worst a
> > "SHOULD NOT" for BOCU-1, as you have it for UTF-32.
> >
> > I refuse to discuss SCSU, but MUST NOT is rather harsh, isn't it ?
>
> As noted earlier, these requirements are derived from real or potential
> security vulnerabilities.
>
>
> > In 3.7.5.4 you say:
> >
> > | Authors should not use JIS_X0212-1990, x-JIS0208, and encodings
> > | based on EBCDIC. Authors should not use UTF-32.
> >
> > What's the logic behind these recommendations ? Of course EBCDIC
> > is rare (as far as HTML is concerned I've never seen it), but it's
> > AFAIK not worse than codepage 437, 850, 858, or similar charsets.
>
> Those are non-US-ASCII-compatible encodings. For further reasoning see the
> thread that resulted in:
>
> http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/011949.html
>
>
> > And UTF-32 is relatively harmless, not much worse than UTF-16, it
> > belongs to the charsets recommended in CHARMOD. Depending on what
> > happens in future Unicode versions banning UTF-32 could backfire.
>
> Actually UTF-32 is quite harmful, due to its extra cost in implementation,
> its very limited testing, and the resulting bugs in almost all known
> implementations.
>
>
> > There are lots of other charsets starting with UTF-1 that could be
> > listed as SHOULD NOT or even MUST NOT. Whatever you pick, state what
> > your reasons are, not only the (apparently) arbitrary result.
>
> The reasons are sometimes rather involved or subtle, and I'd rather not
> have the specification defend itself. It's a spec, not a positon paper. :-)
>
>
> > Please make sure that all *unregistered* charsets are SHOULD NOT. Yes, I
> > know the consequences for some proprietary charsets, they are free to
> > register them or to be ignored (CHARMOD C022).
>
> It's already a must ("The value must be a valid character encoding name,
> and must be the preferred name for that encoding.").
>
>
> On Tue, 29 Jan 2008, Brian Smith wrote:
> > Henri Sivonen wrote:
> > > My understanding is that HTML 5 bans these post-UTF-8
> > > second-system Unicode encodings no matter where you might
> > > declare the use.
> >
> > It is in section 3.7.5 (the META element), and not in section 8 (The
> > HTML Syntax), and the reference to section 3.7.5 in section 8 says that
> > the restrictions apply (only) in a (<META>) character encoding
> > declaration. So, it seems the real issue is just clarifying the text in
> > 3.7.5.4 to indicate that those restrictions apply only when the META
> > charset override mechanism is being used.
>
> I don't understand.
>
>
> > > The purpose of the HTML 5 spec is to improve interoperability between
> > > Web browsers as used with content and Web apps published on the one
> > > public Web. The normative language in the spec is concerned with
> > > publishing and consuming content and apps on the Web. The purpose of
> > > the spec isn't to lower the R&D cost of private and proprietary
> > > systems by producing reusable bits.
> >
> > Then why doesn't the specification list the encodings that conformant
> > web browsers are required to support, instead of listing the encodings
> > that document authors are forbidden from using.
>
> Because former the list is open-ended, whereas the latter list is not,
> and the latter list is more important.
>
>
> > > > Even after Unicode and the UTF encodings, new encodings are still
> > > > being created.
> > >
> > > Deploying such encodings on the public network is a colossally bad
> > > idea. (My own nation has engaged in this folly with ISO-8859-15, so
> > > I've seen the bad consequences at home, too.)
> >
> > That is exactly my point. If the intention is that BOCU-1 should be
> > prohibited, then shouldn't ISO-8859-15 be prohibited for the same
> > reason? Why one and not the other?
>
> One is used. The other is not. It really is that simple. We can stop the
> madness for one of them, but it's too late for the other.
>
>
> > Anyway, I am pretty sure that the restriction against BOCU and similar
> > encodings is just to make it possible to correctly parse the <META>
> > charset override, not to prevent their use altogether. The language just
> > needs to be made clearer.
>
> As the spec says, "authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU
> encodings". There's no limitation to <meta> or anything. They are just
> banned outright.
>
>
> On Thu, 31 Jan 2008, Henri Sivonen wrote:
> >
> > I ran an analysis on recent error messages from Validator.nu.
> > http://hsivonen.iki.fi/test/moz/analysis.txt
>
> Looking at this from the point of view of encodings, I see the following
> common errors:
>
> * <meta charset> not being at the top of <head>
> * missing explicit character encoding declaration
> * <meta content=""> not starting with text/html
> * unpreferred encoding names
>
> I think all of these are real errors, and I don't think we should change
> the spec's encoding rules based on this data.
>
> Thanks for this data. Basing spec development on real data like this is of
> huge value.
>
>
> On Thu, 31 Jan 2008, Sam Ruby wrote:
> > >
> > > I think we should allow the old internal encoding declaration syntax
> > > for text/html as an alternative to the more elegant syntax. Not
> > > declaring the encoding is bad, so we shouldn't send a negative message
> > > to the authors who are declaring the encoding. Moreover, this is
> > > interoperable stuff.
> > >
> > > I think we shouldn't allow this for application/xhtml+xml, though,
> > > because authors might think it has an effect.
> >
> > By that reasoning, a meta charset encoding declaration should not be
> > allowed if a charset is specified on the Content-Type HTTP header. I
> > ran into that very problem today:
> >
> > http://lists.planetplanet.org/archives/devel/2008-January/001747.html
> >
> > This content was XHTML, but was served as text/html, with a charset
> > specified on the HTTP header, which overrode the charset on the meta
> > declaration.
>
> If they don't match, then there's an error (forcibly so, since one of the
> two encodings has to be wrong!).
>
>
> > Serving XHTML as text/html, with BOTH a charset specified on the HTTP
> > header AND a meta charset specified just in case is more common than you
> > might think.
>
> It's not a recommended behaviour, though. Just pick one and use it. The
> practice of making documents schizophrenic like this is a side-effect of
> the market not fully supporting XHTML (i.e. IE). If it wasn't for that,
> people wouldn't be as determined to give their documents identity crises.
>
>
> > A much more useful restriction -- spanning both the HTML5 and XHTML5
> > serializations -- would be to issue an error if multiple sources for
> > encoding information were explicitly specified and if they differ.
>
> That's already required.
>
>
> On Mon, 11 Feb 2008, Henri Sivonen wrote:
> > >
> > > A much more useful restriction -- spanning both the HTML5 and XHTML5
> > > serializations -- would be to issue an error if multiple sources for
> > > encoding information were explicitly specified and if they differ.
> >
> > I agree. I had already implemented this as a warning on the XML side.
> > (Not as an error because I'm not aware of any spec that I could justify
> > for calling it an error.)
>
> If the declarations disagree, one of them is wrong. It's an error for the
> declaration to be wrong.
>
>
> > While I was at it, I noticed that the spec (as well as Gecko) don't
> > require http-equiv='content-type' when looking for a content attribute
> > that looks like an internal encoding declaration. Therefore, I also
> > added a warning that fires if the value of a content attribute would be
> > sniffed as an internal character encoding declaration but a
> > http-equiv='content-type' is missing.
>
> It's an error according to the spec.
>
>
> On Fri, 1 Feb 2008, Henri Sivonen wrote:
> >
> > But surely the value for content should be ASCII-case-insensitive.
>
> Ok.
>
>
> > Also, why limit the space to one U+0020 instead of zero or more space
> > characters?
>
> Ok, allowed any number of space characters (and any space characters).
>
> --
> Ian Hickson U+1047E )\._.,--....,'``. fL
> http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,.
> Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
--
Mark
Received on Friday, 29 February 2008 01:43:10 UTC