Re: several messages about handling encodings in HTML from Mark Davis on 2008-02-29 (public-i18n-core@w3.org from January to March 2008)

From: Mark Davis <mark.davis@icu-project.org>
Date: Thu, 28 Feb 2008 17:42:58 -0800
To: "Ian Hickson" <ian@hixie.ch>
Cc: public-i18n-core@w3.org
Message-ID: <30b660a20802281742v4c920110kac53a1ab0dc6d75a@mail.gmail.com>
On Thu, Feb 28, 2008 at 5:21 PM, Ian Hickson <ian@hixie.ch> wrote:
>
>  Executive summary: I made a number of changes, as described below, in
>  response to the feedback on character encodings in HTML. They are covered
>  by revisions 1263 to 1275 of the spec source.
>
>  I have cc'ed most (though not all) of the mailing lists that were
>  originally cc'ed on the messages to which I reply below, to keep everyone
>  in the loop. Please, for everyone's sake, pick a single mailing list when
>  replying, and trim the quotes to just the bits to which you are replying.
>  Don't include the whole of this e-mail in your reply! Thanks.
>
>
>  On Sun, 5 Nov 2006, Øistein E. Andersen wrote, in reply to Henri:
>  > >
>  > > I think conforming text/html documents should not be allowed to parse
>  > > into a DOM that contains characters that are not allowed in XML 1.0.
>  > > [...] I am inclined to prefer [...] U+FFFD
>
>  (I've made the characters not allowed in XML also not allowed in HTML,
>  with the exception of some of the space characters which we need to have
>  allowed for legacy reasons.)
>
>
>  > I perfectly agree. (Actually, i think that U+7F (delete) and the C1
>  > control characters should be excluded [transformed into U+FFFD] as well,
>  > but this could perhaps be problematic due to spurious CP1252
>  > characters.)
>
>  I've made them illegal but not converted them to FFFD.
>
>
>  On Mon, 6 Nov 2006, Lachlan Hunt wrote:
>  >
>  > At the very least, ISO-8859-1 must be treated as Windows-1252.  I'm not
>  > sure about the other ISO-8859 encodings.  Numeric and hex character
>  > references from 128 to 159 must also be treated as Windows-1252 code
>  > points.
>
>  All already specified.
>
>
>  On Sun, 5 Nov 2006, Elliotte Harold wrote:
>  >
>  > The specific problem is that an author may publish a correctly labeled
>  > UTF-8 or ISO-8859-8 document or some such. However the server sends a
>  > Content-type header that requires the parser to treat the document as
>  > ISO-8859-1 or US-ASCII or something else.
>  >
>  > The need is for server administrators to allow content authors to
>  > specify content types and character sets for the documents they write.
>  > The content doesn't need to change. The authors just need the ability to
>  > specify the server headers for their documents.
>
>  Well, we can't change the way this works from this side, so it's not
>  really our problem at this point.
>
>
>  On Sat, 23 Dec 2006, Henri Sivonen wrote:
>  >
>  > http://www.elementary-group-standards.com/web-standards/html5-http-equiv-difference.html
>  >
>  > In short, some authors want to use <meta http-equiv="imagetoolbar"
>  > content="no"> but (X)HTML5 doesn't allow it.
>  >
>  > Personally, I think that authors who want to disable *User* Agent
>  > features like that are misguided.
>  >
>  > Anyway, I thought I'd mention this so that the issue gets informed as
>  > opposed to accidental treatment.
>
>  Proprietary extensions to HTML are just that, proprietary extensions, and
>  are therefore by intentionally not conforming.
>
>
>  On Mon, 26 Feb 2007, Lachlan Hunt wrote:
>  >
>  > Given that the spec now says that ISO-8859-1 must be treated as
>  > Windows-1252, should it still be considered an error to use the C1
>  > control characters (U+0080 to U+009F) if ISO-8859-1 is declared?
>  >
>  > Some relevant messages from IRC:
>  >
>  > [15:59]       <Lachy> since the spec says if ISO-8859-1 is declared, Windows-1252
>  > must be used. Is it still an error for authors to use the C1 control
>  > characters in the range 128-159?
>  > [16:23]       <Hixie> Lachy: not sure what we should do, there's a bunch of corner
>  > cases there. like, should we allow control chars anyway, should we allow
>  > ISO-8859-1 to be declared but Win1252 to be used, etc.
>  > [16:23]       <Hixie> Lachy: can you mail the list with suggestions and a list of
>  > the cases you can think of that we should cover?
>  > [16:27]       <Lachy> I'm having a hard time deciding if it should be allowed or not
>  > [16:28]       <Lachy> Technically, it is an error and I think users should be
>  > notified, but it's practically harmless these days and very common.
>  > [16:30]       <Lachy> Yet, doing the same thing in XML doesn't work, since XML
>  > parsers do treat them as control characters
>
>  I've made it be a parse error. I'm sure implementing this is going to very
>  exciting for Henri.
>
>
>  On Thu, 1 Mar 2007, Henri Sivonen wrote:
>  >
>  > I think that encoding information should be included in the HTTP
>  > payload. In my opinion, the spec should not advice against this.
>  > Preferably, it would encourage putting the encoding information in the
>  > payload. (The BOM or, in the case of XML, the UTF-8 defaulting of the
>  > XML sniffing algorithm are fine.)
>
>  I can't seem to find the part of the spec that recommends the opposite of
>  this... did I already remove it? I'm happy to make the spec silent on this
>  point, since experts disagree.
>
>
>  On Sun, 11 Mar 2007, Geoffrey Sneddon wrote:
>  >
>  > From implementing parts of the input stream (section 8.2.2 as of
>  > writing) yesterday, I found several issues (some of which will show the
>  > asshole[1] within me):
>  >
>  >       - Within the step one of the get an attribute sub-algorithm it says
>  > "start over" – is this start over the sub-algorithm or the whole algorithm?
>
>  Fixed.
>
>
>  >       - Again in step one, why do we need to skip whitespace in both the
>  > sub-algorithm and at section one of the inner step for <meta> tags?
>
>  Otherwise, the <meta bit would be pointing at the "<" and would treat
>  "meta" as an attribute name.
>
>
>  >       - In step 11, when we have anything apart from a double/single quote
>  > or less/greater than sign, we add it to the value, but don't move the position
>  > forward, so when we move onto step 12 we add it again.
>
>  Yes, valid point. Fixed.
>
>
>  >       - In step 3 of the very inner set of steps for a content attribute in
>  > a <meta> tag, is charset case-sensitive?
>
>  Doesn't matter, the parser lowercases everything anyway.
>
>
>  >       - Again there, shouldn't we be given unicode codepoints for that (as
>  > it'll be a unicode string)?
>
>  Not sure what you mean.
>
>
>  On Sat, 26 May 2007, Henri Sivonen wrote:
>  >
>  > The draft says:
>  > "A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present."
>  >
>  > That's reasonable for UTF-8 when the encoding has been established by
>  > other means.
>  >
>  > However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed to be
>  > signatureless), do we really want to drop the BOM silently? Shouldn't it
>  > count as a character that is in error?
>
>  Do the UTF-16LE and UTF-16BE specs make a leading BOM an error?
>
>  If yes, then we don't have to say anything, it's already an error.
>
>  If not, what's the advantage of complaining about the BOM in this case?
>
>
>  > Likewise, if an encoding signature BOM has been discarded and the first
>  > logical character of the stream is another BOM, shouldn't that also
>  > count as a character that is in error?
>  >
>  > I think I should elaborate that when the encoding is UTF-16 (not
>  > UTF-16LE or UTF-16BE), the BOM is gets swallowed by the character
>  > decoding layer (in reasonable decoder implementations) and is not
>  > returned from the character stream at all. Therefore, on the character
>  > level, a droppable BOM only occurs in UTF-8 when the encoding was
>  > established by other means.
>
>  The spec says: "Given an encoding, the bytes in the input stream must be
>  converted to Unicode characters for the tokeniser, as described by the
>  rules for that encoding, except that leading U+FEFF BYTE ORDER MARK
>  characters must not be stripped by the encoding layer."
>
>
>  On Mon, 28 May 2007, Henri Sivonen wrote:
>  >
>  > To this end, I think at least for conforming documents the algorithm for
>  > establishing the character encoding should be deterministic. I'd like to
>  > request two things:
>  >
>  > 1) When sniffing for meta charset, the current draft allows a use agent
>  > to give up sooner than after examining the first 512 bytes. To make meta
>  > charset sniffing reliable and deterministic so that it doesn't depend on
>  > flukes in buffering, I think UAs should (if there's no transfer protocol
>  > level charset label and no BOM) be required to consumer bytes until they
>  > find a meta charset, reach the EOF or have examined 512 bytes. That is,
>  > I think UAs should not be allowed to give up earlier. (On the other
>  > hand, I think UAs should be allowed to start examining the byte stream
>  > before 512 have been buffered without an IO error, since in general,
>  > byte stream buffer management should be up to the IO libraries and
>  > outside the scope of the HTML spec.)
>
>  I don't want to do this because I don't want to require that browsers
>  handle a CGI script that outputs 500 bytes than hangs for a minute in a
>  way that doesn't render anything for a minute, and I don't want to require
>  that people writing such CGI scripts front-load a 512 byte comment.
>
>  We've already conceeded that a page can document.write() an encoding
>  declaration after 6 megabytes of content and end up causing a reparse.
>
>
>  > 2) Since the chardet step is optional and the spec doesn't make the
>  > Mozilla chardet behavior normative, I think the document should be
>  > considered non-conforming if the algorithm for establishing the
>  > character encoding proceeds to steps 6 (chardet) or 7 (last resort
>  > default).
>
>  That would make most of my pages non-conforming. It would make this
>  non-conforming:
>
>    <!DOCTYPE HTML>
>    <html>
>     <head>
>      <title> Example </title>
>     </head>
>     <body>
>      <p> I don't want to be non-conforming! </p>
>     </body>
>    </html>
>
>
>  > It wouldn't hurt, though, to say in the section on writing documents that at
>  > least one of the following is required for document conformance:
>  >  * A transfer protocol-level character encoding declaration.
>  >  * A meta charset within the first 512 bytes.
>  >  * A BOM.
>
>  We already require that, though without the 512 byte requirement.
>
>
>  On Tue, 29 May 2007, Henri Sivonen wrote:
>  >
>  > To avoid stepping on the toes of Charmod more than is necessary, I
>  > suggest making it non-conforming for a document to have bytes in the
>  > 0x80…0x9F range when the character encoding is declared to be one of the
>  > ISO-8859 family encodings.
>
>  Done, I believe.
>
>
>  > (UA conformance requires in some cases these bytes to be decoded in a
>  > Charmod-violating way, but reality trumps Charmod for UA conformance.
>  > While I'm at it: Surely there are other ISO-8859 family encodings
>  > besides ISO-8859-1 that require decoding using the corresponding
>  > windows-* family decoder?)
>
>  Maybe; anyone have any concrete information?
>
>
>  On Tue, 29 May 2007, Maciej Stachowiak wrote:
>  >
>  > I don't know of any ISO-8859 encodings requiring this, but for all
>  > unicode encodings and numeric entity references compatibility requires
>  > interpreting this range of code points in the WinLatin1 way.
>
>  On Mon, 4 Jun 2007, Henri Sivonen wrote:
>  >
>  > I tested with Firefox 2.0.4, Minefield, Safari 2.0.4, WebKit nightly and
>  > Opera 9.20 (all on Mac). Only Safari 2.0.4 gives the DWIM treatment the
>  > C1 code point range in UTF-8 and UTF-16.
>  >
>  > This makes me suspect that compatibility with the Web doesn't really
>  > require the DWIM treatment here. What does IE7 do?
>  >
>  > The data I used: http://hsivonen.iki.fi/test/utf-c1/
>
>  IE7 and Safari 3 do the same as the other browsers, namely, no DWIM
>  treatment.
>
>  So, I haven't changed the spec.
>
>
>  On Fri, 1 Jun 2007, Henri Sivonen wrote:
>  >
>  > The anomalies seem to be:
>  >  1) ISO-8859-1 is decoded as Windows-1252.
>  >  2) 0x85 in ISO-8859-10 and in ISO-8859-16 is decoded as in Windows-1252
>  > (ellipsis) by Gecko.
>  >  3) ISO-8859-11 is decoded as Windows-874.
>  >
>  > I was rather surprised by the results. They weren't at all what I expected.
>  > Test data: http://hsivonen.iki.fi/test/iso8859/
>  >
>  > I suggest adding the ISO-8859-11 to Windows-874 mapping to the spec.
>
>  On Fri, 1 Jun 2007, Henri Sivonen wrote:
>  >
>  > By Firefox and Opera. Safari doesn't support ISO-8859-11 and I was
>  > unable to test IE.
>
>  On Fri, 1 Jun 2007, Simon Pieters wrote:
>  >
>  > IE7 and Opera handle ISO-8859-11.htm the same, AFAICT.
>
>  I did some studies and there appear to be enough pages as ISO-8859-11 to
>  add this. I didn't check how many had bytes in the affected range, which
>  maybe would be worth checking, though.
>
>
>  On Sat, 2 Jun 2007, Øistein E. Andersen wrote:
>  >
>  > As suggested earlier [1], a simpler solution seems to be to treat C1
>  > bytes and NCRs from /all/ ISO-8859-* and Unicode encodings as
>  > Windows-1252.
>
>  That seems excessive.
>
>
>  On Tue, 5 Jun 2007, Henri Sivonen wrote:
>  > >
>  > > To avoid stepping on the toes of Charmod more than is necessary, I
>  > > suggest making it non-conforming for a document to have bytes in the
>  > > 0x80…0x9F range when the character encoding is declared to be one of
>  > > the ISO-8859 family encodings.
>  >
>  > I've been thinking about this. I have a proposal on how to spec this
>  > *conceptually* and how to implement this with error reporting. I am
>  > assuming here that 1) No one ever intends C1 code points to be present
>  > in the decoded stream and 2) we want, as a Charmod correctness fig leaf,
>  > to make the C1 bytes non-conforming when ISO-8859-1 or ISO-8859-11 was
>  > declared but Windows-1252 or Windows-874 decoding is needed.
>
>  I really don't care too much about the fig leaf part.
>
>
>  > Based on the behavior of Minefield and Opera 9.20, the following seems
>  > to be the least Charmod violating and least quirky approach that could
>  > possibly work:
>  >
>  > 1) Decode the byte stream using a decoder for whatever encoding was declared,
>  > even ISO-8859-1 or ISO-8859-11, according to ftp://
>  > ftp.unicode.org/Public/MAPPINGS/.
>  > 2) If a character in the decoded character stream is in the C1 code point
>  > range, this is a document conformance violation.
>  >    2a) If the declared encoding was ISO-8859-1, replace that character with
>  > the character that you get by casting the code point into a byte and decoding
>  > it as Windows-1252.
>  >    2b) If the declared encoding was ISO-8859-11, replace that character with
>  > the character that you get by casting the code point into a byte and decoding
>  > it as Windows-874.
>
>  That sounds far more complex than what we have now.
>
>
>  On Tue, 5 Jun 2007, Kristof Zelechovski wrote:
>  >
>  >     2c) If the declared encoding was ISO-8859-2, replace that character
>  > with the character that you get by casting the code point into a byte
>  > and decoding it as Windows-1250.
>
>  On Tue, 5 Jun 2007, Henri Sivonen wrote:
>  >
>  > As far as I can tell, that's not what Firefox, Minefield, Opera 9.20 and
>  > WebKit nightlies do, so apparently it is not required for compatibility
>  > with a notable number of pages.
>
>  Indeed.
>
>
>  On Tue, 5 Jun 2007, Maciej Stachowiak wrote:
>  >
>  > What we actually do in WebKit is always use a windows-1252 decoder when
>  > ISO-8859-1 is requested. I don't think it's very helpful to make all
>  > documents that declare a ISO-8859-1 encoding and use characters in the
>  > C1 range nonconforming. It's true that they are counting on nonstandard
>  > processing of the nominally declared encoding, but I don't think that
>  > causes a problem in practice, as long as the rule is well known. It
>  > seems simpler to just make latin1 an alias for winlatin1.
>
>  I agree.
>
>
>  On Fri, 1 Jun 2007, Raphael Champeimont (Almacha) wrote:
>  >
>  > I think there is something wrong in the "get an attribute" algorithm
>  > from 8.2.2. The input stream.
>  >
>  > Between steps 11 and 12 I think there is a missing:
>  >
>  > 11b: Advance position to the next byte.
>  >
>  > With the current algorithm, if I write <meta charset = ascii> it will
>  > say the value of attribute charset is "aascii" with one too much leading
>  > A
>  >
>  > The reason is that in step 11 if we fall in case "Anything else" we add
>  > the new char to the string, and then if we fall in "Anything else" in
>  > step 12 we add again the *same* char to the string, so the first char of
>  > the attribute value appears 2 times.
>
>  Fixed. (Though please check. I made several changes to this algorithm and
>  would be happier if I knew someone had proofread the changes!)
>
>
>  On Fri, 1 Jun 2007, Henri Sivonen wrote:
>  >
>  > In the charset meta sniffing algorithm under "Attribute name:":
>  >
>  > > If it is 0x2F (ASCII '/'), 0x3C (ASCII '<'), or 0x3E (ASCII '>')
>  > >     Stop looking for an attribute. The attribute's name is the value of
>  > > attribute name, its value is the empty string.
>  >
>  > In general, it seems to me the algorithm isn't quite clear on when to
>  > stop looking for the current attribute and when to stop looking for
>  > attributes for the current tag altogether.
>
>  The spec never distinguishes these two cases in the "get an attribute"
>  algorithm -- the algorithm that invokes the "get an attribute" algorithm
>  is the one that decides how often it is done.
>
>
>  > In this step, it seems to me that '/' should advance the pointer and end
>  > getting the current attribute followed by getting another attribute. '>'
>  > should end getting attributes on the whole tag without changing the
>  > pointer.
>
>  It doesn't matter. Both return an attribute, then the invoking algorithm
>  retries and if that results in no attribute (because you're on the ">")
>  then you stop looking for the tag.
>
>
>  On Fri, 1 Jun 2007, Henri Sivonen wrote:
>  >
>  > The spec probably needs to be made more specific about the case where
>  > the ASCII byte-based algorithm finds a supported encoding name but the
>  > encoding is not a rough ASCII superset.
>  >
>  > 23:46 < othermaciej> one quirk in Safari is that if there's a meta tag
>  > claiming
>  >                      the source is utf-16, we treat it as utf-8
>  > ...
>  > 23:48 < othermaciej> hsivonen: there is content that needs it
>  > ...
>  > 23:52 < othermaciej> hsivonen: I think we may treat any claimed unicode
>  > charset
>  >                      in a <meta> tag as utf-8
>
>  Oops, I had this for the case where utf-16 was detected on the fly, but
>  not for the preparser. Fixed.
>
>
>  On Sat, 2 Jun 2007, Philip Taylor wrote:
>  >
>  > 8.2.2. The input stream: "If the next six characters are not 'charset'"
>  > - s/six/seven/
>
>  Fixed.
>
>
>  On Thu, 14 Jun 2007, Henri Sivonen wrote:
>  >
>  > As written, the charset sniffing algorithm doesn't trim space characters
>  > from around the tentative encoding name. html5lib test case expect the
>  > space characters to be trimmed.
>  >
>  > I suggest trimming space characters (or anything <= 0x20 depending on
>  > which approach is the right for compat).
>
>  Actually it seems browsers don't do any trimming here. I've added a
>  comment to that effect.
>
>
>  On Sat, 23 Jun 2007, Øistein E. Andersen wrote:
>  > >>
>  > >>> Bytes or sequences of bytes in the original byte stream that could
>  > >>> not be converted to Unicode characters must be converted to U+FFFD
>  > >>> REPLACEMENT CHARACTER code points.
>  > >>
>  > >> [This does not specify the exact number of replacement chracters.]
>  > >
>  > > I don't really know how to define this.
>  >
>  > Unicode 5.0 remains vague on this point. (E.g., definition D92 defines
>  > well-formed and ill-formed UTF-8 byte sequences, but conformance
>  > requirement C10 only requires ill-formed sequences to be treated as an
>  > error condition and suggests that a one-byte ill-formed sequence may be
>  > either filtered out or replaced by a U+FFFD replacement character.) More
>  > generally, character encoding specifications can hardly be expected to
>  > define proper error handling, since they are usually not terribly
>  > preoccupied with mislabelled data.
>
>  They should define error handling, and are defective if they don't.
>  However, I agree that many specs are defective. This is certainly not
>  limited to character encoding specifications.
>
>
>  > The current text may nevertheless be two liberal. It would notably be
>  > possible to construct an arbitrarily long Chinese text in a legacy
>  > encoding which -- according to the spec -- could be replaced by one
>  > single U+FFFD replacement character if incorrectly handled as UTF-8,
>  > which might lead the user to think that the page is completely
>  > uninteresting and therefore move on, whereas a larger number of
>  > replacement characters would have led him to try another encoding. (This
>  > is only a problem, of course, if an implementor chooses to emit the
>  > minimal number of replacement characters sanctioned by the spec.)
>
>  Yes, but this is a user interface issue, not an interoperability issue, so
>  I don't think we need to be concerned about it.
>
>
>  On Thu, 2 Aug 2007, Henri Sivonen wrote:
>
>  > On Aug 2, 2007, at 10:11, Ian Hickson wrote:
>  >
>  > > Would a non-normative note help here? Something like:
>  > >
>  > >    Note: Bytes or sequences of bytes in the original byte stream that did
>  > >    not conform to the encoding specification (e.g. invalid UTF-8 byte
>  > >    sequences in a UTF-8 input stream) are errors that conformance
>  > >    checkers are expected to report.
>  > >
>  > > ...to be put after the paragraph that reads "Bytes or sequences of
>  > > bytes in the original byte stream that could not be converted to
>  > > Unicode characters must be converted to U+FFFD REPLACEMENT CHARACTER
>  > > code points".
>  >
>  > Yes, this is what I meant with "a note hinting the consequences.
>
>  Ok, added.
>
>
>  > > (Note that not all bytes or sequences of bytes in the original byte
>  > > stream that could not be converted to Unicode characters are
>  > > necessarily errors. It could just be that the encoding has a character
>  > > set that isn't a subset of Unicode, e.g. the Apple logo found in most
>  > > Apple character sets doesn't have a non-PUA analogue in Unicode. Its
>  > > presence in an HTML document isn't an error as far as I'm concerned.)
>  >
>  > Since XML and HTML5 are defined in terms of Unicode, characters there's
>  > nowhere to go except error and REPLACEMENT CHARACTER or the PUA for
>  > characters that aren't in Unicode. I'd steer clear of this in the spec
>  > an let decoders choose between de facto PUA assignments (like U+F8FF for
>  > the Apple logo) and errors.
>
>  Yeah I don't have any intention on mentioning this in the spec.
>
>
>  On Wed, 31 Oct 2007, Martin Duerst wrote:
>  >
>  > [8.2.2.1]
>  >
>  > In point 3., it's not completely clear whether the encoding returned is
>  > e.g. "UTF-16BE BOM" or "UTF-16BE". Probably the best thing editorially
>  > is to move the word BOM from the description column of the table to the
>  > text prior to the table.
>
>  Fixed.
>
>
>  > In point 7, what I find unnecessary is the repeated mention of heuristic
>  > algorithms, which are already mentioned previously in point 6.
>
>  The heuristics in step 6 are for detemrining an encoding based on the byte
>  stream, e.g. using frequency analysis. The heuristics in step 7 are for
>  picking a default once that has failed. For example, if the defaults are
>  UTF-8 or Win1252, then you can determine which to pick by simply deciding
>  whether or not the stream is valid UTF-8.
>
>
>  > (I'm really interested what document [UNIVCHADET] is going to point to.)
>
>  http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
>
>  (It's in the source.)
>
>
>  > What I find missing/unclear is that the user can overwrite the page
>  > encoding manually. What is mentioned is a user-specificed default, which
>  > makes sense (e.g. "well, I'm mostly viewing Chinese pages, so I set my
>  > default to GB2132"). However, what we also need is the possibility for a
>  > user to override the encoding of a specific page (not changing the
>  > default). This is necessary because some pages are still mislabeled.
>  > When such an override is present, it should come before what's currently
>  > number 1.
>
>  User agents can provide user interfaces to override anything they want,
>  e.g. they could provide an interface that changes all <script> elements
>  into <pre> elements on the fly, or whatever. Such behaviour is outside the
>  scope of the specification, since it is no longer about interoperability,
>  but about user control. It's technically non-compliant, because it is
>  doing something with the page that doesn't match what would happen for
>  other people (unless they _also_ overrode the spec behaviour).
>
>
>  > In 8.2.2.2, what I find unnecessary is that encodings such as UTF-7 are
>  > explicitly forbidden. I agree that these are virtually useless. However,
>  > I don't think implementing them would create any harm, and I don't think
>  > they should be dignified by even mentioning them.
>
>  Sadly they do cause harm. The ones that are outlawed have all been used in
>  eithir actual attacks or proof-of-concept attacks described in
>  vulnerability reports, mostly due to their deceptive similarity to more
>  common encodings. (UTF-7 in particular has been used in a number of
>  attacks, because IE supported auto-detecting it, if I recall correctly.)
>
>
>  > In 8.2.2.4, I have no idea what's the reason or purpose of point 1,
>  > which reads "If the new encoding is UTF-16, change it to UTF-8.". I
>  > suspect some misunderstanding.
>
>  This is required because many pages are labelled as UTF-16 but actually
>  use UTF-8. For example:
>
>   http://www.zingermans.com
>
>
>  > Well, now let's get back to CharMod, and to the place where I think you
>  > need to do more work. HTML5 currently says "treat data labeled
>  > iso-8859-1 as windows-1252". This conflicts with C025 of CharMod
>  > (http://www.w3.org/TR/charmod/#C025):
>  >
>  > C025 [I] [C] An IANA-registered charset name MUST NOT be used to label
>  > text data in a character encoding other than the one identified in the
>  > IANA registration of that name.
>  >
>  > and also C030 (http://www.w3.org/TR/charmod/#C030): C030 [I] When an
>  > IANA-registered charset name is recognized, receiving software MUST
>  > interpret the received data according to the encoding associated with
>  > the name in the IANA registry.
>  >
>  > So the following sentence:
>  >
>  > "When a user agent would otherwise use the ISO-8859-1 encoding, it must
>  > instead use the Windows-1252 encoding."
>  >
>  > from HTML5 is clearly not conforming to CharMod.
>
>  Indeed, it says so explicitly in the spec.
>
>
>  > Please note that the above items (C025 and C030) say that they only
>  > affect implementations ([I]) and content ([C]), but I think the main
>  > reason for this is that we never even immagined that a spec would say
>  > "you must treat FOO as BAR".
>  >
>  > I don't disagree with 'widely deployed', but I think one main reason for
>  > this is that it took ages to get windows-1252 registered. I think there
>  > are other ways to deal with this issue than a MUST. One thing that I
>  > guess you could do is to just describe current practice.
>
>  Well, what we're describing is what an implementation has to do to be
>  compatible with the other implementations. And right now, this is one of
>  the things it has to do.
>
>
>  > This brings me to another point: The whole HTML5 spec seems to be
>  > written with implementers, and implementers only, in mind. This is great
>  > to help get browser behavior aligned, but it creates an enormous
>  > problem: The majority of potential users of the spec, namely creators of
>  > content, and of tools creating content, are completely left out. As an
>  > example, trying to reverse-engineer how to indicate the character
>  > encoding inside an HTML5 document from point 4 in 8.2.2.1 is completely
>  > impossible for content creators, webmasters, and the like.
>
>  Section "8.2 Parsing HTML documents" is indeed exclusively for user agent
>  implementors and conformance checker implementors. For authors and
>  authoring tool implementors, you want section "8.1 Writing HTML documents"
>  and section "3.7.5.4. Specifying the document's character encoding" (which
>  is linked to from 8.1). These give the flipside of these requirements, the
>  authoring side.
>
>
>  On Sat, 3 Nov 2007, Addison Phillips wrote:
>  >
>  > --
>  > Otherwise, return an implementation-defined or user-specified default
>  > character encoding, with the confidence tentative. Due to its use in
>  > legacy content, windows-1252 is recommended as a default in
>  > predominantly Western demographics. In non-legacy environments, the more
>  > comprehensive UTF-8 encoding is recommended instead. Since these
>  > encodings can in many cases be distinguished by inspection, a user agent
>  > may heuristically decide which to use as a default.
>  > --
>  >
>  > Our comment is that this is a pretty weak recommendation. It is
>  > difficult to say what a "Western demographic" means in this context. We
>  > think we know why this is here: untagged HTML4 documents have a default
>  > character encoding of ISO 8859-1, so it is unsurprising to assume its
>  > common superset encoding when no other encoding can be guessed.
>  >
>  > However, we would like to see several things happen here:
>  >
>  > 1. It never actually says anywhere why windows-1252 must be used instead
>  > of ISO 8859-1.
>
>  This is required in "Preprocessing the input stream".
>
>
>  > 2. As quoted, it seems to (but does not actually) favor 1252 over UTF-8.
>  > Since UTF-8 is highly detectable and also the best long-term general
>  > default, we'd prefer if the emphasis were reversed, dropping the
>  > reference to "Western demographics". For example:
>  >
>  > --
>  > Otherwise, return an implementation-defined or user-specified default
>  > character encoding, with the confidence tentative. UTF-8 is recommended
>  > as a default encoding in most cases. Due to its use in legacy content,
>  > windows-1252 is also recommended as a default. Since these encodings can
>  > usually be distinguished by inspection, a user agent may heuristically
>  > decide which to use as a default.
>  > --
>
>  I've reversed the order, though not removed the mention of the Western
>  demographic, which I think is actually quite accurate and genernally more
>  understandable than, say, occidental. I would like to know what the more
>  common codecs are in oriental demographics, though, to broaden the use of
>  the recommendations.
>
>
>  > 3. Possibly something should be said (elsewhere, not in this paragraph)
>  > about using other "superset" encodings in preference to the explicitly
>  > named encoding (that is, other encodings bear the same relationship as
>  > windows-1252 does to iso8859-1 and user-agents actually use these
>  > encodings to interpret pages and/or encode data in forms, etc.)
>
>  Is the current (new) text sufficient in this regard? See also the earlier
>  comments for details on the decisions behind the new text.
>
>
>  On Thu, 6 Dec 2007, Sam Ruby wrote:
>  > Ian Hickson wrote:
>  > > On Wed, 5 Dec 2007, Sam Ruby wrote:
>  > > > Henri Sivonen wrote:
>  > > > > I identified four classes of errors:
>  > > > >  1) meta charset in XHTML
>  > > > Why specifying a charset that matches the encoding is flagged as an
>  > > > error is probably something that should be discussed another day.
>  > > > I happen to believe that people will author content intended to be
>  > > > used by multiple user agents which are at various levels of spec
>  > > > conformance.
>  > >
>  > > That's actually an XML issue -- XML says the encoding should be in the
>  > > XML declaration, so HTML tries to not step on its toes and says that
>  > > the charset declaration shouldn't be included in the markup. (The spec
>  > > has to say that the UA must ignore that line anyway, so it's not clear
>  > > that there's any benefit to including it.)
>  >
>  > If the declaration clashed, I could see the value in an error message,
>  > but as I said, this can be discussed another day.
>
>  Is it another day yet? :-)
>
>
>  On Fri, 25 Jan 2008, Frank Ellermann wrote:
>  >
>  > Hi, the chapter about "acceptable" charsets (8.2.2.2) is messy. Clearly
>  > UTF-8 and windows-1252 are popular, and you have that.
>  >
>  > What you need as a "minimum" for new browsers is UTF-8, US-ASCII (as
>  > popular proper subset of UTF-8), ISO-8859-1 (as HTML legacy), and
>  > windows-1252 for the reasons stated in the draft, supporting Latin-1 but
>  > not windows-1252 would be stupid.
>
>  Right, that's what the draft current requires.
>
>
>  > BTW, I'm not aware that windows-1252 is a violation of CHARMOD, I asked
>  > a question about it and C049 in a Last Call of CHARMOD.
>
>  See one of the earlier e-mails in this compound reply for the reasoning.
>
>
>  > Please s/but may support more/but should support more/ - the minimum is
>  > only that, the minimum.
>
>  "SHOULD" has very strong connotations that I do not think apply here. In
>  particular, it makes no sense to have an open-ended SHOULD in this
>  context.
>
>
>  > | User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
>  > | encodings
>  >
>  > I can see a MUST NOT for UTF-7 and CESU-8.  And IMO the only good excuse
>  > for legacy charsets is backwards compatibility.  But that is at worst a
>  > "SHOULD NOT" for BOCU-1, as you have it for UTF-32.
>  >
>  > I refuse to discuss SCSU, but MUST NOT is rather harsh, isn't it ?
>
>  As noted earlier, these requirements are derived from real or potential
>  security vulnerabilities.
>
>
>  > In 3.7.5.4 you say:
>  >
>  > | Authors should not use JIS_X0212-1990, x-JIS0208, and encodings
>  > | based on EBCDIC.  Authors should not use UTF-32.
>  >
>  > What's the logic behind these recommendations ?  Of course EBCDIC
>  > is rare (as far as HTML is concerned I've never seen it), but it's
>  > AFAIK not worse than codepage 437, 850, 858, or similar charsets.
>
>  Those are non-US-ASCII-compatible encodings. For further reasoning see the
>  thread that resulted in:
>
>    http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/011949.html
>
>
>  > And UTF-32 is relatively harmless, not much worse than UTF-16, it
>  > belongs to the charsets recommended in CHARMOD.  Depending on what
>  > happens in future Unicode versions banning UTF-32 could backfire.
>
>  Actually UTF-32 is quite harmful, due to its extra cost in implementation,
>  its very limited testing, and the resulting bugs in almost all known
>  implementations.
>
>
>  > There are lots of other charsets starting with UTF-1 that could be
>  > listed as SHOULD NOT or even MUST NOT.  Whatever you pick, state what
>  > your reasons are, not only the (apparently) arbitrary result.
>
>  The reasons are sometimes rather involved or subtle, and I'd rather not
>  have the specification defend itself. It's a spec, not a positon paper. :-)
>
>
>  > Please make sure that all *unregistered* charsets are SHOULD NOT. Yes, I
>  > know the consequences for some proprietary charsets, they are free to
>  > register them or to be ignored (CHARMOD C022).
>
>  It's already a must ("The value must be a valid character encoding name,
>  and must be the preferred name for that encoding.").
>
>
>  On Tue, 29 Jan 2008, Brian Smith wrote:
>  > Henri Sivonen wrote:
>  > > My understanding is that HTML 5 bans these post-UTF-8
>  > > second-system Unicode encodings no matter where you might
>  > > declare the use.
>  >
>  > It is in section 3.7.5 (the META element), and not in section 8 (The
>  > HTML Syntax), and the reference to section 3.7.5 in section 8 says that
>  > the restrictions apply (only) in a (<META>) character encoding
>  > declaration. So, it seems the real issue is just clarifying the text in
>  > 3.7.5.4 to indicate that those restrictions apply only when the META
>  > charset override mechanism is being used.
>
>  I don't understand.
>
>
>  > > The purpose of the HTML 5 spec is to improve interoperability between
>  > > Web browsers as used with content and Web apps published on the one
>  > > public Web. The normative language in the spec is concerned with
>  > > publishing and consuming content and apps on the Web. The purpose of
>  > > the spec isn't to lower the R&D cost of private and proprietary
>  > > systems by producing reusable bits.
>  >
>  > Then why doesn't the specification list the encodings that conformant
>  > web browsers are required to support, instead of listing the encodings
>  > that document authors are forbidden from using.
>
>  Because former the list is open-ended, whereas the latter list is not,
>  and the latter list is more important.
>
>
>  > > > Even after Unicode and the UTF encodings, new encodings are still
>  > > > being created.
>  > >
>  > > Deploying such encodings on the public network is a colossally bad
>  > > idea. (My own nation has engaged in this folly with ISO-8859-15, so
>  > > I've seen the bad consequences at home, too.)
>  >
>  > That is exactly my point. If the intention is that BOCU-1 should be
>  > prohibited, then shouldn't ISO-8859-15 be prohibited for the same
>  > reason? Why one and not the other?
>
>  One is used. The other is not. It really is that simple. We can stop the
>  madness for one of them, but it's too late for the other.
>
>
>  > Anyway, I am pretty sure that the restriction against BOCU and similar
>  > encodings is just to make it possible to correctly parse the <META>
>  > charset override, not to prevent their use altogether. The language just
>  > needs to be made clearer.
>
>  As the spec says, "authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU
>  encodings". There's no limitation to <meta> or anything. They are just
>  banned outright.
>
>
>  On Thu, 31 Jan 2008, Henri Sivonen wrote:
>  >
>  > I ran an analysis on recent error messages from Validator.nu.
>  > http://hsivonen.iki.fi/test/moz/analysis.txt
>
>  Looking at this from the point of view of encodings, I see the following
>  common errors:
>
>   * <meta charset> not being at the top of <head>
>   * missing explicit character encoding declaration
>   * <meta content=""> not starting with text/html
>   * unpreferred encoding names
>
>  I think all of these are real errors, and I don't think we should change
>  the spec's encoding rules based on this data.
>
>  Thanks for this data. Basing spec development on real data like this is of
>  huge value.
>
>
>  On Thu, 31 Jan 2008, Sam Ruby wrote:
>  > >
>  > > I think we should allow the old internal encoding declaration syntax
>  > > for text/html as an alternative to the more elegant syntax. Not
>  > > declaring the encoding is bad, so we shouldn't send a negative message
>  > > to the authors who are declaring the encoding. Moreover, this is
>  > > interoperable stuff.
>  > >
>  > > I think we shouldn't allow this for application/xhtml+xml, though,
>  > > because authors might think it has an effect.
>  >
>  > By that reasoning, a meta charset encoding declaration should not be
>  > allowed if a charset is specified on the Content-Type HTTP header.  I
>  > ran into that very problem today:
>  >
>  > http://lists.planetplanet.org/archives/devel/2008-January/001747.html
>  >
>  > This content was XHTML, but was served as text/html, with a charset
>  > specified on the HTTP header, which overrode the charset on the meta
>  > declaration.
>
>  If they don't match, then there's an error (forcibly so, since one of the
>  two encodings has to be wrong!).
>
>
>  > Serving XHTML as text/html, with BOTH a charset specified on the HTTP
>  > header AND a meta charset specified just in case is more common than you
>  > might think.
>
>  It's not a recommended behaviour, though. Just pick one and use it. The
>  practice of making documents schizophrenic like this is a side-effect of
>  the market not fully supporting XHTML (i.e. IE). If it wasn't for that,
>  people wouldn't be as determined to give their documents identity crises.
>
>
>  > A much more useful restriction -- spanning both the HTML5 and XHTML5
>  > serializations -- would be to issue an error if multiple sources for
>  > encoding information were explicitly specified and if they differ.
>
>  That's already required.
>
>
>  On Mon, 11 Feb 2008, Henri Sivonen wrote:
>  > >
>  > > A much more useful restriction -- spanning both the HTML5 and XHTML5
>  > > serializations -- would be to issue an error if multiple sources for
>  > > encoding information were explicitly specified and if they differ.
>  >
>  > I agree. I had already implemented this as a warning on the XML side.
>  > (Not as an error because I'm not aware of any spec that I could justify
>  > for calling it an error.)
>
>  If the declarations disagree, one of them is wrong. It's an error for the
>  declaration to be wrong.
>
>
>  > While I was at it, I noticed that the spec (as well as Gecko) don't
>  > require http-equiv='content-type' when looking for a content attribute
>  > that looks like an internal encoding declaration. Therefore, I also
>  > added a warning that fires if the value of a content attribute would be
>  > sniffed as an internal character encoding declaration but a
>  > http-equiv='content-type' is missing.
>
>  It's an error according to the spec.
>
>
>  On Fri, 1 Feb 2008, Henri Sivonen wrote:
>  >
>  > But surely the value for content should be ASCII-case-insensitive.
>
>  Ok.
>
>
>  > Also, why limit the space to one U+0020 instead of zero or more space
>  > characters?
>
>  Ok, allowed any number of space characters (and any space characters).
>
>  --
>  Ian Hickson               U+1047E                )\._.,--....,'``.    fL
>  http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
>  Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'



-- 
Mark
Received on Friday, 29 February 2008 01:43:10 UTC