Details on internal encoding declarations from Henri Sivonen on 2008-03-20 (public-html@w3.org from March 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 20 Mar 2008 14:16:35 +0200
To: HTML WG <public-html@w3.org>
Message-Id: <49461B55-DD47-4BC8-B3CE-271899703A61@iki.fi>
> 3.7.5.4. Specifying the document's character encoding
>
> A character encoding declaration is a mechanism by which the  
> character encoding used to store or transmit a document is specified.
>
> The following restrictions apply to character encoding declarations:
>
>     * The character encoding name given must be the name of the  
> character encoding used to serialise the file.

Please add a note here explaining that as a consequence of the  
presence of the UTF-8 BOM, the only permitted value for the internal  
encoding declaration is UTF-8.

>     * The value must be a valid character encoding name, and must be  
> the preferred name for that encoding. [IANACHARSET]

As a practical matter, are conformance checkers expected to  
distinguish between encoding names that cannot be encoding names due  
to form and encoding names that could be encoding names but are unknown?

>     * The encoding name must be serialised without the use of  
> character entity references or character escapes of any kind.

I assume this layer-violating requirement is here in order to keep the  
prescan simple. In that case, the requirement should not apply only to  
the encoding name but to the entire attribute value containing the  
encoding name.

> If the document does not start with a BOM, and if its encoding is  
> not explicitly given by Content-Type metadata, then the character  
> encoding used must be a superset of US-ASCII (specifically,  
> ANSI_X3.4-1968) for bytes in the range 0x09 - 0x0D, 0x20, 0x21,  
> 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A , and,  
> in addition, if that encoding isn't US-ASCII itself, then the  
> encoding must be specified using a meta element with a charset  
> attribute or a meta element in the Encoding declaraton state.

Please add that encodings that aren't supersets of US-ASCII must never  
appear in a meta even if the encoding was also unambiguously given by  
Content-Type metadata or the BOM.

> Authors should not use JIS_X0212-1990, x-JIS0208, and encodings  
> based on EBCDIC.

It would be nice to have a note about a good heuristic for detecting  
whether an encoding is EBCDIC-based. Right now I'm using if not ascii  
superset and starts with cp, ibm or x-ibm.

[...]
> Authors are encouraged to use UTF-8. Conformance checkers may advise  
> against authors using legacy encodings.


It might be good to have a note about the badness with form submission  
and not-quite-IRI processing that non-UTF-8 encodings cause.

> 8.2.2.4. Changing the encoding while parsing
>
> When the parser requires the user agent to change the encoding, it  
> must run the following steps.
[...]
> 1. If the new encoding is UTF-16, change it to UTF-8.

Please be specific about UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and  
UTF-32LE.

Also, changing to encoding to something that is not an US-ASCII  
superset should probably not happen (I haven't tested, though) and  
trigger an error.

[...]
> While the invocation of this algorithm is not a parse error, it is  
> still indicative of non-conforming content.


This a bit annoying from an implementor point of view. It's a "left as  
exercise to the reader" kind of note. That's not good for interop,  
because the reader might fail the exercise.

Also, as it stands, the conformance of documents that aren't ASCII- 
only and that don't have external encoding information or a BOM is  
really fuzzy. Surely, it can't be right that a document that is not  
encoded in windows-1252 but contains a couple of megabytes of ASCII- 
only junk (mainly space characters) before the encoding meta is  
conforming if there's no other fault. Surely, a document that causes a  
reparse a couple of megabytes down the road is broken if not downright  
malicious.

I'd still like to keep a notion of conformance that makes all  
conforming HTML5 documents parseable with a truly streaming parser.  
(True streaming may involve buffering n bytes up front with a  
reasonable spec-set value of n but may not involve arbitrary buffering  
depending on the input.)

Perhaps the old requirement of having the internal encoding meta (the  
whole tag--not just the key attribute) within the first 512 bytes of  
the document wasn't such a bad idea after all, even though it does  
make it harder to write a serializer that produces only conforming  
output even when the input of the serializer is malicious.

In that case, conformance could be defined so that a document is non- 
conforming if changing the encoding while parsing actually would  
change the encoding even when the prescan was run on exactly 512 bytes.

> A start tag whose tag name is "meta"
>
>     Insert an HTML element for the token. Immediately pop the  
> current node off the stack of open elements.
>
>     If the element has a charset attribute, and its value is a  
> supported encoding, and the confidence is currently tentative, then  
> change the encoding to the encoding given by the value of the  
> charset attribute.

Even though it's not strictly necessary to say it here, it would be  
helpful to put a note here saying that if the confidence is already  
certain and the meta-declared encoding after resolving aliases and  
magic superset promotions does not equal the encoding in use or  
returns an encoding that isn't a superset of US-ASCII, this is a parse  
error.

>     Otherwise, if the element has a content attribute, and applying  
> the algorithm for extracting an encoding from a Content-Type to its  
> value returns a supported encoding encoding, and the confidence is  
> currently tentative, then change the encoding to the encoding  
> encoding.

What if the algorithm for extracting the encoding returns a value but  
it isn't a supported encoding. Shouldn't that count as an error of  
some kind?

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Thursday, 20 March 2008 12:17:26 UTC