- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Sat, 11 Mar 2006 17:10:31 +0200
On Mar 10, 2006, at 22:49, Ian Hickson wrote: > I'm actually considering just requiring that UAs support rewinding (by > defining the exact semantics of how to parse for the <meta> > header). Is > this something people would object to? I think allowing in-place decoder change (when feasible) would be good for performance. >> I think it would be beneficial to additionally stipulate that >> 1. The meta element-based character encoding information >> declaration is >> expected to work only if the Basic Latin range of characters maps >> to the same >> bytes as in the US-ASCII encoding. > > Is this realistic? I'm not really familiar enough with character > encodings > to say if this is what happens in general. I suppose it is realistic. See below. >> 2. If there is no external character encoding information nor a >> BOM (see >> below), there MUST NOT be any non-ASCII bytes in the document byte >> stream before the end of the meta element that declares the character >> encoding. (In practice this would ban unescaped non-ASCII class >> names on >> the html and [head] elements and non-ASCII comments at the >> beginning of >> the document.) > > Again, can we realistically require this? I need to do some studies of > non-latin pages, I guess. As UA behavior, no. As a conformance requirement, maybe. >>> Authors should avoid including inline character encoding >>> information. >>> Character encoding information should instead be included at the >>> transport level (e.g. using the HTTP Content-Type header). >> >> I disagree. >> >> With HTML with contemporary UAs, there is no real harm in >> including the >> character encoding information both on the HTTP level and in the >> meta as >> long as the information is not contradictory. On the contrary, the >> author-provided internal information is actually useful when end >> users >> save pages to disk using UAs that do not reserialize with internal >> character encoding information. > > ...and it breaks everything when you have a transcoding proxy, or > similar. Well, not until you save to disk, since HTTP takes precedence. However, authors can escape this by using UTF-8. (Assuming here that tampering with UTF-8 would be harmful, wrong and pointless.) Interestingly, transcoding proxies tend to be brought up by residents of Western Europe, North America or the Commonwealth. I have never seen a Russion person living in Russia or a Japanese person living in Japan talk about transcoding proxies in any online or offline discussion. That's why I doubt the importance of transcoding proxies. FWIW, I think Opera Mini is a distributed UA--not a proxy and a UA. > Character encoding information shouldn't be duplicated, IMHO, > that's just > asking for trouble. I suggest a mismatch be considered an easy parse error and, therefore, reportable. >>> For HTML, user agents must use the following algorithm in >>> determining the >>> character encoding of a document: >>> 1. If the transport layer specifies an encoding, use that. >> >> Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 >> only; UTF-32 >> makes no practical sense for interchange on the Web.) > > I don't know, should there? I believe there should. >>> 2. Otherwise, if the user agent can find a meta element that >>> specifies >>> character encoding information (as described above), then use that. >> >> If a conformance checker has not determined the character encoding by >> now, what should it do? Should it report the document as non- >> conforming >> (my preferred choice)? Should it default to US-ASCII and report any >> non-ASCII bytes as conformance errors? Should it continue to the >> fuzzier >> steps like browsers would (hopefully not)? > > Again, I don't know. I'll continue to treat such documents as non-conforming, then. > Currently the behaviour is very underspecified here: > > http://whatwg.org/specs/web-apps/current-work/#documentEncoding > > I'd like to rewrite that bit. It will require a lot of research; of > existing authoring practices, of current UAs, and of author needs. If > anyone wants to step up and do the work, I'd be very happy to work > with > them and get something sorted out here. Disclaimer: This is not based on reading the source of the Gecko or WebKit. Instead, this is based on quick research in character encodings and on black box testing of Firefox 1.5, Opera 9.0 preview and Safari 2.0.3. Tests: http://hsivonen.iki.fi/test/wa10/encoding- detection/ (c- means that I think it should be a conforming case and nc- means that I think it should be a non-conforming case.) It turns out that most character encodings have the property that in the initial state of the decoder the bytes 0x20?0x7E (inclusive) as well as 0x09, 0x0A and 0x0D decode to the Unicode code points of the same (zero-extended) value. Character encodings that have this property (hereafter "rough ASCII superset") include: Big5 Big5-HKSCS EUC-JP EUC-KR GB18030 GB2312 GBK IBM00858 IBM437 IBM775 IBM850 IBM852 IBM855 IBM857 IBM860 IBM861 IBM862 IBM863 IBM865 IBM866 IBM868 IBM869 ISO-2022-CN ISO-2022-JP ISO-2022-KR ISO-8859-1 ISO-8859-10 ISO-8859-13 ISO-8859-14 ISO-8859-15 ISO-8859-16 ISO-8859-2 ISO-8859-3 ISO-8859-4 ISO-8859-5 ISO-8859-6 ISO-8859-7 ISO-8859-8 ISO-8859-9 JIS_X0201 KOI8-R KOI8-U MacRoman Shift_JIS TIS-620 US-ASCII UTF-8 VISCII windows-1250 windows-1251 windows-1252 windows-1253 windows-1254 windows-1255 windows-1256 windows-1257 windows-1258 windows-31j x-ARMSCII x-Big5-Solaris x-EUC-TW x-IBM1006 x-IBM1046 x-IBM1098 x-IBM1124 x-IBM1381 x-IBM1383 x-IBM737 x-IBM856 x-IBM874 x-IBM921 x-IBM922 x-IBM942C x-IBM943C x-IBM948 x-IBM949C x-IBM950 x-IBM970 x-ISO-2022-CN-CNS x-ISO-2022-CN-GB x-JISAutoDetect x-Johab x-MS950-HKSCS x-MacArabic x-MacCentralEurope x-MacCroatian x-MacCyrillic x-MacGreek x-MacHebrew x-MacIceland x-MacRomania x-MacThai x-MacTurkish x-MacUkraine x-PCK x-euc-jp-linux x-eucJP-Open x-iso-8859-11 x-iso-8859-12 x-mswin-936 x-windows-874 x-windows-949 x-windows-950 Notably, character encodings that I am aware of and do not have this property are: JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbat and x-MacSymbol, UTF-7, UTF-16 and UTF-32. The x-MacDingbat and x-MacSymbol encodings are irrelevant to Web pages. After browsing the encoding menus of Firefox, Opera and Safari, I'm pretty confident that the legacy IBM codepages are irrelevant as well. I suggest the following algorithm as a starting point. It does not handle UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208. - - Set the REWIND flag to unraised. Read the first four bytes of the byte stream. If the bytes constitute a big-endian UTF-32 BOM, set the character encoding to big-endian UTF-32 and initialize the corresponding decoder. The detection algorithm terminates. If the bytes constitute a little-endian UTF-32 BOM, set the character encoding to littel-endian UTF-32 and initialize the corresponding decoder. The detection algorithm terminates. If the first two bytes constitute a big-endian UTF-16 BOM, set the character encoding to big-endian UTF-16, unread the third and fourth byte and initialize the corresponding decoder. The detection algorithm terminates. If the first two bytes constitute a little-endian UTF-16 BOM, set the character encoding to little-endian UTF-16, unread the third and fourth byte and initialize the corresponding decoder. The detection algorithm terminates. If the first three bytes constitute a UTF-8 BOM, set the character encoding to UTF-8, unread the fourth byte and initialize the corresponding decoder. The detection algorithm terminates. If the bytes have the pattern 0x00, 0x00, 0x00, 0x00, emit a hard parse error, unread the bytes and perform implementation-specific heuristics. Set the character encoding to the output of the heuristics. The detection algorithm terminates. (Note: need more testing here.) If the bytes have the pattern 0x00, 0x00, 0x00, NOT-0x00, set the character encoding to UTF-32BE, emit an easy parse error, unread the bytes and initialize the corresponding decoder. The detection algorithm terminates. (Note: need more testing here.) If the bytes have the pattern NOT-0x00, 0x00, 0x00, 0x00, set the character encoding to UTF-32LE, emit an easy parse error, unread the bytes and initialize the corresponding decoder. The detection algorithm terminates. (Note: need more testing here.) If the first two bytes have the pattern 0x00, NOT-0x00, set the character encoding to UTF-16BE, emit an easy parse error, unread the bytes and initialize the corresponding decoder. The detection algorithm terminates. (Note: need more testing here.) If the first two bytes have the pattern NOT-0x00, 0x00, set the character encoding to UTF-16LE, emit an easy parse error, unread the bytes and initialize the corresponding decoder. The detection algorithm terminates. (Note: need more testing here.) Initialize a character decoder that the bytes 0x20?0x7E (inclusive) as well as 0x09, 0x0A and 0x0D decode to the Unicode code points of the same (zero-extended) value and maps all other bytes to U+FFFD and raises a REWIND flag and emits an easy parse error when doing so. If the UA supports in-place decoder switching (see below), the decoder should not buffer and should only consume one byte of the byte stream when one character is read from the decoder. Start the HTML parser but do not execute scripts. If the script start tag is seen and the UA supports scripting, raise the REWIND flag and emit an easy parse error. If a start tag other than html or head is seen, emit an easy parse error. If the end of the head element is seen, emit a hard parse error, perform implementation-specific heuristics, tear down the DOM, rewind the byte stream and restart the parser. The detection algorithm terminates. If a meta element whose http-equiv attribute has the value "Content- Type" (compare case-insensitively) and whose content attribute has a value that begins with "text/html; charset=", the string in the content attribute following the start "text/html; charset=" is taken, white space removed from the sides and considered the tentative encoding name. (Note: Safari allows spaces, line breaks and tabs around the attribute values. Firefox allows spaces. Opera does not allow anything extra.) If the tentative encoding name does not identify a rough ASCII superset supported by the UA, emit a hard parse error and perform implementation-specific heuristics. Set the character encoding to the output of the heuristics. If the REWIND flag has been raised, rewind the byte stream and tear down the DOM. If the REWIND flag has not been raised and the heuristics yield a rough ASCII superset, either change the decoder in place or rewind the byte stream, tear down the DOM and restart the parser. (Changing in place is recommended.) The detection algorithm terminates. If the tentative encoding name identifies a rough ASCII superset supported by the UA, set the character encoding to the tentative encoding. If the REWIND flag has been raised, rewind the byte stream and tear down the DOM. If the REWIND flag has not been raised, either change the decoder in place or rewind the byte stream, tear down the DOM and restart the parser. (Changing in place is recommended.) The detection algorithm terminates. Where performing implementation-specific heuristics is called for, the UA may analyze the byte spectrum using statistical methods. However, at minimum the UA must fall back on a user-chosen encoding that is rough ASCII subset. This user choice should default to Windows-1252. - - Requirements I'd like to see: Documents must specify a character encoding an must use an IANA- registered encoding and must identify it using its preferred MIME name or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must recognize the preferred MIME name of every encoding they support that has a preferred MIME name. UAs should recognize IANA-registered aliases. Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE (i.e. BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the EBCDIC family of encodings. Documents using the UTF-16 or UTF-32 encodings must have a BOM. UAs must support the UTF-8 encoding. Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.) Authors are adviced to use the UTF-8 encoding. Authors are adviced not to use the UTF-32 encoding or legacy encodings. (Note: I think UTF-32 on the Web is harmful and utterly pointless, but Firefox and Opera support it. Also, I'd like to have some text in the spec that justifies whining about legacy encodings. On the XML side, I give warnings if the encoding is not UTF-8, UTF-16, US-ASCII or ISO-8859-1. I also warn about aliases and potential trouble with RFC 3023 rules. However, I have no spec backing for treating dangerous RFC 3023 stuff as errors.) - - Also, the spec should probably give guidance on what encodings need to be supported. That set should include at least UTF-8, US-ASCII, ISO-8859-1 and Windows-1252. It should probably not be larger than the intersection of the sets of encodings supported by Firefox, Opera, Safari and IE6. (It might even be useful to intersect that set with the encodings supported by JDK and Python by default.) -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Saturday, 11 March 2006 07:10:31 UTC