- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 20 Mar 2008 14:16:35 +0200
- To: HTML WG <public-html@w3.org>
> 3.7.5.4. Specifying the document's character encoding > > A character encoding declaration is a mechanism by which the > character encoding used to store or transmit a document is specified. > > The following restrictions apply to character encoding declarations: > > * The character encoding name given must be the name of the > character encoding used to serialise the file. Please add a note here explaining that as a consequence of the presence of the UTF-8 BOM, the only permitted value for the internal encoding declaration is UTF-8. > * The value must be a valid character encoding name, and must be > the preferred name for that encoding. [IANACHARSET] As a practical matter, are conformance checkers expected to distinguish between encoding names that cannot be encoding names due to form and encoding names that could be encoding names but are unknown? > * The encoding name must be serialised without the use of > character entity references or character escapes of any kind. I assume this layer-violating requirement is here in order to keep the prescan simple. In that case, the requirement should not apply only to the encoding name but to the entire attribute value containing the encoding name. > If the document does not start with a BOM, and if its encoding is > not explicitly given by Content-Type metadata, then the character > encoding used must be a superset of US-ASCII (specifically, > ANSI_X3.4-1968) for bytes in the range 0x09 - 0x0D, 0x20, 0x21, > 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A , and, > in addition, if that encoding isn't US-ASCII itself, then the > encoding must be specified using a meta element with a charset > attribute or a meta element in the Encoding declaraton state. Please add that encodings that aren't supersets of US-ASCII must never appear in a meta even if the encoding was also unambiguously given by Content-Type metadata or the BOM. > Authors should not use JIS_X0212-1990, x-JIS0208, and encodings > based on EBCDIC. It would be nice to have a note about a good heuristic for detecting whether an encoding is EBCDIC-based. Right now I'm using if not ascii superset and starts with cp, ibm or x-ibm. [...] > Authors are encouraged to use UTF-8. Conformance checkers may advise > against authors using legacy encodings. It might be good to have a note about the badness with form submission and not-quite-IRI processing that non-UTF-8 encodings cause. > 8.2.2.4. Changing the encoding while parsing > > When the parser requires the user agent to change the encoding, it > must run the following steps. [...] > 1. If the new encoding is UTF-16, change it to UTF-8. Please be specific about UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE. Also, changing to encoding to something that is not an US-ASCII superset should probably not happen (I haven't tested, though) and trigger an error. [...] > While the invocation of this algorithm is not a parse error, it is > still indicative of non-conforming content. This a bit annoying from an implementor point of view. It's a "left as exercise to the reader" kind of note. That's not good for interop, because the reader might fail the exercise. Also, as it stands, the conformance of documents that aren't ASCII- only and that don't have external encoding information or a BOM is really fuzzy. Surely, it can't be right that a document that is not encoded in windows-1252 but contains a couple of megabytes of ASCII- only junk (mainly space characters) before the encoding meta is conforming if there's no other fault. Surely, a document that causes a reparse a couple of megabytes down the road is broken if not downright malicious. I'd still like to keep a notion of conformance that makes all conforming HTML5 documents parseable with a truly streaming parser. (True streaming may involve buffering n bytes up front with a reasonable spec-set value of n but may not involve arbitrary buffering depending on the input.) Perhaps the old requirement of having the internal encoding meta (the whole tag--not just the key attribute) within the first 512 bytes of the document wasn't such a bad idea after all, even though it does make it harder to write a serializer that produces only conforming output even when the input of the serializer is malicious. In that case, conformance could be defined so that a document is non- conforming if changing the encoding while parsing actually would change the encoding even when the prescan was run on exactly 512 bytes. > A start tag whose tag name is "meta" > > Insert an HTML element for the token. Immediately pop the > current node off the stack of open elements. > > If the element has a charset attribute, and its value is a > supported encoding, and the confidence is currently tentative, then > change the encoding to the encoding given by the value of the > charset attribute. Even though it's not strictly necessary to say it here, it would be helpful to put a note here saying that if the confidence is already certain and the meta-declared encoding after resolving aliases and magic superset promotions does not equal the encoding in use or returns an encoding that isn't a superset of US-ASCII, this is a parse error. > Otherwise, if the element has a content attribute, and applying > the algorithm for extracting an encoding from a Content-Type to its > value returns a supported encoding encoding, and the confidence is > currently tentative, then change the encoding to the encoding > encoding. What if the algorithm for extracting the encoding returns a value but it isn't a supported encoding. Shouldn't that count as an error of some kind? -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 20 March 2008 12:17:26 UTC