[whatwg] Internal character encoding declaration from Henri Sivonen on 2006-03-11 (public-whatwg-archive@w3.org from March 2006)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sat, 11 Mar 2006 17:10:31 +0200
Message-ID: <C065F7B0-BF94-449C-BC78-7CF40316FD17@iki.fi>
On Mar 10, 2006, at 22:49, Ian Hickson wrote:

> I'm actually considering just requiring that UAs support rewinding (by
> defining the exact semantics of how to parse for the <meta>  
> header). Is
> this something people would object to?

I think allowing in-place decoder change (when feasible) would be  
good for performance.

>> I think it would be beneficial to additionally stipulate that
>> 1. The meta element-based character encoding information  
>> declaration is
>> expected to work only if the Basic Latin range of characters maps  
>> to the same
>> bytes as in the US-ASCII encoding.
>
> Is this realistic? I'm not really familiar enough with character  
> encodings
> to say if this is what happens in general.

I suppose it is realistic. See below.

>> 2. If there is no external character encoding information nor a  
>> BOM (see
>> below), there MUST NOT be any non-ASCII bytes in the document byte
>> stream before the end of the meta element that declares the character
>> encoding. (In practice this would ban unescaped non-ASCII class  
>> names on
>> the html and [head] elements and non-ASCII comments at the  
>> beginning of
>> the document.)
>
> Again, can we realistically require this? I need to do some studies of
> non-latin pages, I guess.

As UA behavior, no. As a conformance requirement, maybe.

>>> Authors should avoid including inline character encoding  
>>> information.
>>> Character encoding information should instead be included at the
>>> transport level (e.g. using the HTTP Content-Type header).
>>
>> I disagree.
>>
>> With HTML with contemporary UAs, there is no real harm in  
>> including the
>> character encoding information both on the HTTP level and in the  
>> meta as
>> long as the information is not contradictory. On the contrary, the
>> author-provided internal information is actually useful when end  
>> users
>> save pages to disk using UAs that do not reserialize with internal
>> character encoding information.
>
> ...and it breaks everything when you have a transcoding proxy, or  
> similar.

Well, not until you save to disk, since HTTP takes precedence.  
However, authors can escape this by using UTF-8. (Assuming here that  
tampering with UTF-8 would be harmful, wrong and pointless.)

Interestingly, transcoding proxies tend to be brought up by residents  
of Western Europe, North America or the Commonwealth. I have never  
seen a Russion person living in Russia or a Japanese person living in  
Japan talk about transcoding proxies in any online or offline  
discussion. That's why I doubt the importance of transcoding proxies.

FWIW, I think Opera Mini is a distributed UA--not a proxy and a UA.

> Character encoding information shouldn't be duplicated, IMHO,  
> that's just
> asking for trouble.

I suggest a mismatch be considered an easy parse error and,  
therefore, reportable.

>>> For HTML, user agents must use the following algorithm in  
>>> determining the
>>> character encoding of a document:
>>> 1. If the transport layer specifies an encoding, use that.
>>
>> Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8  
>> only; UTF-32
>> makes no practical sense for interchange on the Web.)
>
> I don't know, should there?

I believe there should.

>>> 2. Otherwise, if the user agent can find a meta element that  
>>> specifies
>>> character encoding information (as described above), then use that.
>>
>> If a conformance checker has not determined the character encoding by
>> now, what should it do? Should it report the document as non- 
>> conforming
>> (my preferred choice)? Should it default to US-ASCII and report any
>> non-ASCII bytes as conformance errors? Should it continue to the  
>> fuzzier
>> steps like browsers would (hopefully not)?
>
> Again, I don't know.

I'll continue to treat such documents as non-conforming, then.

> Currently the behaviour is very underspecified here:
>
>    http://whatwg.org/specs/web-apps/current-work/#documentEncoding
>
> I'd like to rewrite that bit. It will require a lot of research; of
> existing authoring practices, of current UAs, and of author needs. If
> anyone wants to step up and do the work, I'd be very happy to work  
> with
> them and get something sorted out here.

Disclaimer: This is not based on reading the source of the Gecko or  
WebKit. Instead, this is based on quick research in character  
encodings and on black box testing of Firefox 1.5, Opera 9.0 preview  
and Safari 2.0.3. Tests: http://hsivonen.iki.fi/test/wa10/encoding- 
detection/ (c- means that I think it should be a conforming case and  
nc- means that I think it should be a non-conforming case.)

It turns out that most character encodings have the property that in  
the initial state of the decoder the bytes 0x20?0x7E (inclusive) as  
well as 0x09, 0x0A and 0x0D decode to the Unicode code points of the  
same (zero-extended) value. Character encodings that have this  
property (hereafter "rough ASCII superset") include:
Big5
Big5-HKSCS
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM00858
IBM437
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM865
IBM866
IBM868
IBM869
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-10
ISO-8859-13
ISO-8859-14
ISO-8859-15
ISO-8859-16
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
KOI8-R
KOI8-U
MacRoman
Shift_JIS
TIS-620
US-ASCII
UTF-8
VISCII
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-ARMSCII
x-Big5-Solaris
x-EUC-TW
x-IBM1006
x-IBM1046
x-IBM1098
x-IBM1124
x-IBM1381
x-IBM1383
x-IBM737
x-IBM856
x-IBM874
x-IBM921
x-IBM922
x-IBM942C
x-IBM943C
x-IBM948
x-IBM949C
x-IBM950
x-IBM970
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-JISAutoDetect
x-Johab
x-MS950-HKSCS
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRomania
x-MacThai
x-MacTurkish
x-MacUkraine
x-PCK
x-euc-jp-linux
x-eucJP-Open
x-iso-8859-11
x-iso-8859-12
x-mswin-936
x-windows-874
x-windows-949
x-windows-950

Notably, character encodings that I am aware of and do not have this  
property are:
JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbat  
and x-MacSymbol, UTF-7, UTF-16 and UTF-32.

The x-MacDingbat and x-MacSymbol encodings are irrelevant to Web  
pages. After browsing the encoding menus of Firefox, Opera and  
Safari, I'm pretty confident that the legacy IBM codepages are  
irrelevant as well.

I suggest the following algorithm as a starting point. It does not  
handle UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208.

- -

Set the REWIND flag to unraised.

Read the first four bytes of the byte stream.

If the bytes constitute a big-endian UTF-32 BOM, set the character  
encoding to big-endian UTF-32 and initialize the corresponding  
decoder. The detection algorithm terminates.

If the bytes constitute a little-endian UTF-32 BOM, set the character  
encoding to littel-endian UTF-32 and initialize the corresponding  
decoder. The detection algorithm terminates.

If the first two bytes constitute a big-endian UTF-16 BOM, set the  
character encoding to big-endian UTF-16, unread the third and fourth  
byte and initialize the corresponding decoder. The detection  
algorithm terminates.

If the first two bytes constitute a little-endian UTF-16 BOM, set the  
character encoding to little-endian UTF-16, unread the third and  
fourth byte and initialize the corresponding decoder. The detection  
algorithm terminates.

If the first three bytes constitute a UTF-8 BOM, set the character  
encoding to UTF-8, unread the fourth byte and initialize the  
corresponding decoder. The detection algorithm terminates.

If the bytes have the pattern 0x00, 0x00, 0x00, 0x00, emit a hard  
parse error, unread the bytes and perform implementation-specific  
heuristics. Set the character encoding to the output of the  
heuristics. The detection algorithm terminates. (Note: need more  
testing here.)

If the bytes have the pattern 0x00, 0x00, 0x00, NOT-0x00, set the  
character encoding to UTF-32BE, emit an easy parse error, unread the  
bytes and initialize the corresponding decoder. The detection  
algorithm terminates. (Note: need more testing here.)

If the bytes have the pattern NOT-0x00, 0x00, 0x00, 0x00, set the  
character encoding to UTF-32LE, emit an easy parse error, unread the  
bytes and initialize the corresponding decoder. The detection  
algorithm terminates. (Note: need more testing here.)

If the first two bytes have the pattern 0x00, NOT-0x00, set the  
character encoding to UTF-16BE, emit an easy parse error, unread the  
bytes and initialize the corresponding decoder. The detection  
algorithm terminates. (Note: need more testing here.)

If the first two bytes have the pattern NOT-0x00, 0x00, set the  
character encoding to UTF-16LE, emit an easy parse error, unread the  
bytes and initialize the corresponding decoder. The detection  
algorithm terminates. (Note: need more testing here.)

Initialize a character decoder that the bytes 0x20?0x7E (inclusive)  
as well as 0x09, 0x0A and 0x0D decode to the Unicode code points of  
the same (zero-extended) value and maps all other bytes to U+FFFD and  
raises a REWIND flag and emits an easy parse error when doing so. If  
the UA supports in-place decoder switching (see below), the decoder  
should not buffer and should only consume one byte of the byte stream  
when one character is read from the decoder.

Start the HTML parser but do not execute scripts.

If the script start tag is seen and the UA supports scripting, raise  
the REWIND flag and emit an easy parse error.

If a start tag other than html or head is seen, emit an easy parse  
error.

If the end of the head element is seen, emit a hard parse error,  
perform implementation-specific heuristics, tear down the DOM, rewind  
the byte stream and restart the parser. The detection algorithm  
terminates.

If a meta element whose http-equiv attribute has the value "Content- 
Type" (compare case-insensitively) and whose content attribute has a  
value that begins with "text/html; charset=", the string in the  
content attribute following the start "text/html; charset=" is taken,  
white space removed from the sides and considered the tentative  
encoding name.
(Note: Safari allows spaces, line breaks and tabs around the  
attribute values. Firefox allows spaces. Opera does not allow  
anything extra.)

If the tentative encoding name does not identify a rough ASCII  
superset supported by the UA, emit a hard parse error and perform  
implementation-specific heuristics. Set the character encoding to the  
output of the heuristics. If the REWIND flag has been raised, rewind  
the byte stream and tear down the DOM. If the REWIND flag has not  
been raised and the heuristics yield a rough ASCII superset, either  
change the decoder in place or rewind the byte stream, tear down the  
DOM and restart the parser. (Changing in place is recommended.) The  
detection algorithm terminates.

If the tentative encoding name identifies a rough ASCII superset  
supported by the UA, set the character encoding to the tentative  
encoding. If the REWIND flag has been raised, rewind the byte stream  
and tear down the DOM. If the REWIND flag has not been raised, either  
change the decoder in place or rewind the byte stream, tear down the  
DOM and restart the parser. (Changing in place is recommended.) The  
detection algorithm terminates.

Where performing implementation-specific heuristics is called for,  
the UA may analyze the byte spectrum using statistical methods.  
However, at minimum the UA must fall back on a user-chosen encoding  
that is rough ASCII subset. This user choice should default to  
Windows-1252.

- -

Requirements I'd like to see:

Documents must specify a character encoding an must use an IANA- 
registered encoding and must identify it using its preferred MIME  
name or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must recognize  
the preferred MIME name of every encoding they support that has a  
preferred MIME name. UAs should recognize IANA-registered aliases.

Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE  
(i.e. BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from  
the EBCDIC family of encodings. Documents using the UTF-16 or UTF-32  
encodings must have a BOM.

UAs must support the UTF-8 encoding.

Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)

Authors are adviced to use the UTF-8 encoding. Authors are adviced  
not to use the UTF-32 encoding or legacy encodings. (Note: I think  
UTF-32 on the Web is harmful and utterly pointless, but Firefox and  
Opera support it. Also, I'd like to have some text in the spec that  
justifies whining about legacy encodings. On the XML side, I give  
warnings if the encoding is not UTF-8, UTF-16, US-ASCII or  
ISO-8859-1. I also warn about aliases and potential trouble with RFC  
3023 rules. However, I have no spec backing for treating dangerous  
RFC 3023 stuff as errors.)

- -

Also, the spec should probably give guidance on what encodings need  
to be supported. That set should include at least UTF-8, US-ASCII,  
ISO-8859-1 and Windows-1252. It should probably not be larger than  
the intersection of the sets of encodings supported by Firefox,  
Opera, Safari and IE6. (It might even be useful to intersect that set  
with the encodings supported by JDK and Python by default.)

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/
Received on Saturday, 11 March 2006 07:10:31 UTC