- From: Ian Hickson <ian@hixie.ch>
- Date: Sat, 23 Jun 2007 09:35:51 +0000 (UTC)
On Sat, 11 Mar 2006, Henri Sivonen wrote: > > I think allowing in-place decoder change (when feasible) would be good > for performance. Done. > > > I think it would be beneficial to additionally stipulate that > > > > > > 1. The meta element-based character encoding information declaration > > > is expected to work only if the Basic Latin range of characters maps > > > to the same bytes as in the US-ASCII encoding. > > > > Is this realistic? I'm not really familiar enough with character > > encodings to say if this is what happens in general. > > I suppose it is realistic. See below. That was already there, turns out. > > > 2. If there is no external character encoding information nor a BOM > > > (see below), there MUST NOT be any non-ASCII bytes in the document > > > byte stream before the end of the meta element that declares the > > > character encoding. (In practice this would ban unescaped non-ASCII > > > class names on the html and [head] elements and non-ASCII comments > > > at the beginning of the document.) > > > > Again, can we realistically require this? I need to do some studies of > > non-latin pages, I guess. > > As UA behavior, no. As a conformance requirement, maybe. I don't think we should require this, given the preparse step. I can if people think we should, though. > > > > Authors should avoid including inline character encoding > > > > information. Character encoding information should instead be > > > > included at the transport level (e.g. using the HTTP Content-Type > > > > header). > > > > > > I disagree. > > > > > > With HTML with contemporary UAs, there is no real harm in including > > > the character encoding information both on the HTTP level and in the > > > meta as long as the information is not contradictory. On the > > > contrary, the author-provided internal information is actually > > > useful when end users save pages to disk using UAs that do not > > > reserialize with internal character encoding information. > > > > ...and it breaks everything when you have a transcoding proxy, or > > similar. > > Well, not until you save to disk, since HTTP takes precedence. However, > authors can escape this by using UTF-8. (Assuming here that tampering > with UTF-8 would be harmful, wrong and pointless.) > > Interestingly, transcoding proxies tend to be brought up by residents of > Western Europe, North America or the Commonwealth. I have never seen a > Russion person living in Russia or a Japanese person living in Japan > talk about transcoding proxies in any online or offline discussion. > That's why I doubt the importance of transcoding proxies. I think this discouragement has been removed now. Let me know if it lives on somewhere. > > Character encoding information shouldn't be duplicated, IMHO, that's > > just asking for trouble. > > I suggest a mismatch be considered an easy parse error and, therefore, > reportable. I believe this is required in the spec. > > > > For HTML, user agents must use the following algorithm in > > > > determining the character encoding of a document: > > > > 1. If the transport layer specifies an encoding, use that. > > > > > > Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; > > > UTF-32 makes no practical sense for interchange on the Web.) > > > > I don't know, should there? > > I believe there should. There's a BOM step in the spec; let me know if you think it's in the wrong place. > > > > 2. Otherwise, if the user agent can find a meta element that > > > > specifies character encoding information (as described above), > > > > then use that. > > > > > > If a conformance checker has not determined the character encoding > > > by now, what should it do? Should it report the document as > > > non-conforming (my preferred choice)? Should it default to US-ASCII > > > and report any non-ASCII bytes as conformance errors? Should it > > > continue to the fuzzier steps like browsers would (hopefully not)? > > > > Again, I don't know. > > I'll continue to treat such documents as non-conforming, then. I've made it non-conforming to not use ASCII if you've got no encoding information and no BOM. > Notably, character encodings that I am aware of and [aren't > ASCII-compatible] are: > > JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbat > and x-MacSymbol, UTF-7, UTF-16 and UTF-32. > > The x-MacDingbat and x-MacSymbol encodings are irrelevant to Web pages. > After browsing the encoding menus of Firefox, Opera and Safari, I'm > pretty confident that the legacy IBM codepages are irrelevant as well. > > I suggest the following algorithm as a starting point. It does not handle > UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208. I've made those either MUST NOTs or SHOULD NOTs, amongst others. > Set the REWIND flag to unraised. The REWIND idea sadly doesn't work very well given that you can actually have things like javascript: URIs and event handlers that execute on content in the <head>, in pathological cases. However, I did something similar in the spec as it stands now. > Requirements I'd like to see: > > Documents must specify a character encoding an must use an > IANA-registered encoding and must identify it using its preferred MIME > name or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must recognize the > preferred MIME name of every encoding they support that has a preferred > MIME name. UAs should recognize IANA-registered aliases. Done. > Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE (i.e. > BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the EBCDIC > family of encodings. Documents using the UTF-16 or UTF-32 encodings must > have a BOM. Done except for UTF-16BE and UTF-16LE, though you might want to check that the spec says exactly what you want. > UAs must support the UTF-8 encoding. Done. > Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.) Encoding errors are covered by the encoding specs. > Authors are adviced to use the UTF-8 encoding. Authors are adviced not > to use the UTF-32 encoding or legacy encodings. (Note: I think UTF-32 on > the Web is harmful and utterly pointless, but Firefox and Opera support > it. Also, I'd like to have some text in the spec that justifies whining > about legacy encodings. On the XML side, I give warnings if the encoding > is not UTF-8, UTF-16, US-ASCII or ISO-8859-1. I also warn about aliases > and potential trouble with RFC 3023 rules. However, I have no spec > backing for treating dangerous RFC 3023 stuff as errors.) Done, except about the RFC3023 stuff. Could you elaborate on that? I don't really have anything about encodings and XML in the spec. > Also, the spec should probably give guidance on what encodings need to > be supported. That set should include at least UTF-8, US-ASCII, > ISO-8859-1 and Windows-1252. It should probably not be larger than the > intersection of the sets of encodings supported by Firefox, Opera, > Safari and IE6. (It might even be useful to intersect that set with the > encodings supported by JDK and Python by default.) Made it just UTF-8 and Win1252. On Sat, 11 Mar 2006, Henri Sivonen wrote: > On Mar 11, 2006, at 17:10, Henri Sivonen wrote: > > > Where performing implementation-specific heuristics is called for, the > > UA may analyze the byte spectrum using statistical methods. However, > > at minimum the UA must fall back on a user-chosen encoding that is > > rough ASCII subset. This user choice should default to Windows-1252. > > This will violate Charmod, but what can you do? Indeed. (The HTML5 spec says the above.) On Sun, 12 Mar 2006, Henri Sivonen wrote: > > On further reflection, it occurred to me that emitting the Windows-1252 > characters instead of U+FFFD would be a good optimization for the common > case where the encoding later turns out to be Windows-1252 or > ISO-8859-1. This would require more that one bookkeeping flag, though. Required, always. On Sun, 12 Mar 2006, Henri Sivonen wrote: > > For ISO-8859-* family encodings that have a corresponding Windows-* > family superset (e.g. Windows-1252 for ISO-8859-1) the UA must use the > Windows-* family superset decoder instead of the ISO-8859-* family > decoder. However, any bytes in the 0x80?0x9F (inclusive) are easy > parse errors. That isn't what the spec says, but I have other outstanding comments on this to deal with still. > I would like the spec to say that if the page has forms, using an > encoding other than UTF-8 is trouble. And even for pages that don't have > forms, using an encoding that is not known to be extremely well > supported introduces incompatibility for no good reason. Does the current text (which doesn't mention forms) satisfy you? On Tue, 14 Mar 2006, Lachlan Hunt wrote: > > This will need to handle common mistakes such as the following: > > <meta ... content="application/xhtml+xml;charset=X"> > <meta ... content="foo/bar;charset=X"> > <meta ... content="foo/bar;charset='X'"> > <meta ... content="charset=X"> > <meta ... charset="X"> The ones that matter are now in the spec, as far as I can tell. On Tue, 14 Mar 2006, Peter Karlsson wrote: > > > > Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE > > (i.e. BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the > > EBCDIC family of encodings. Documents using the UTF-16 or UTF-32 > > encodings must have a BOM. > > I don't think forbidding BOCU-1 is a good idea. If there is ever a > proper specification written of it, it could be very useful as a > compression format for documents. BOCU-1 has been used for security attacks. It's on the "no fly" list. On Tue, 15 May 2007, Michael Day wrote: > > Suggestion: drop UTF-32 from the character encoding detection section of > HTML5, and even better, discourage of forbid user agents from > implementing support for UTF-32. Done. On Wed, 16 May 2007, Geoffrey Sneddon wrote: > > Including it in a few encoding detection algorithms is no big deal on us > implementers: as the spec stands we aren't required to support it > anyway. All the spec requires is that we include it within our encoding > detections (so, if we don't support it, we can then reject it). Right now it's not even being detected by the spec. On Mon, 4 Jun 2007, Henri Sivonen wrote: > > What's the right thing for an implementation to do when UTF-32 is not > supported? Decode as Windows-1252? Does that make sense? That's basically what the spec requires now. On Mon, 4 Jun 2007, Alexey Feldgendler wrote: > > Seems like a general question: what's the right thing to do when the > document's encoding is not supported? There isn't a reasonable fallback > for every encoding. The spec right now requires UAs to ignore <meta charset=""> declarations they don't understand. On Mon, 4 Jun 2007, Henri Sivonen wrote: > > I think it is perfectly reasonable to make support for UTF-8 and > Windows-1252 part of UA conformance requirements. After all, a piece of > software that doesn't support those two really has no business > pretending to be a UA for the World Wide Web. Not supporting > Windows-1252 based on "local market" arguments is serious > walled-gardenism. Indeed. On Mon, 4 Jun 2007, Alexey Feldgendler wrote: > > On the other hand, declaring Windows-1252 as the default encoding is > monoculturalism. For example, in Russia, whenever Windows-1252 is > chosen, it is definitely a wrong choice. It's never used in Russia > because it doesn't contain Cyrillic letters. A default of Windows-1251 > or KOI8-R might be reasonable in Russia, though none of them is a 100% > safe guess. The spec allows any guess. On Sun, 27 May 2007, Henri Sivonen wrote: > > "If the encoding is one of UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or > UTF-32LE, then authors can use a BOM at the start of the file to > indicate the character encoding." > > That sentence should read: That sentence is now gone. The "writing HTML" section generically allows leading BOMs regardless of character encoding. > The encoding labels with LE or BE in them mean BOMless variants where > the encoding label on the transfer protocol level gives the endianness. > See http://www.ietf.org/rfc/rfc2781.txt When the spec refers to UTF-16 > with BOM in a particular endianness, I think the spec should use > "big-endian UTF-16" and "little-endian UTF-16". > > Since declaring endianness on the transfer protocol level has no benefit > over using the BOM when the label is right and there's a chance to get > the label wrong, the encoding labels with explicit endianness are > harmful for interchange. In my opinion, the spec should avoid giving > authors any bad ideas by reinforcing these labels by repetition. If you know the encoding before going in (e.g. it's in Content-Type metadata) then if the BOM is correctly encoded you just ignore it, and if it's incorrectly encoded then you won't see it as a BOM and you'll probably treat it as U+FFFD. From an authoring standpoint, especially given that tools now tend to output BOMs silently (e.g. Notepad), and *especially* considering that a BOM is invisible, it would just be a pain to have to take out the first character in certain cases. No? -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Saturday, 23 June 2007 02:35:51 UTC