- From: Ian Hickson <ian@hixie.ch>
- Date: Thu, 22 May 2008 22:29:03 +0000 (UTC)
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: HTML WG <public-html@w3.org>
On Thu, 20 Mar 2008, Henri Sivonen wrote: > > 3.7.5.4. Specifying the document's character encoding > > > > A character encoding declaration is a mechanism by which the character > > encoding used to store or transmit a document is specified. > > > > The following restrictions apply to character encoding declarations: > > > > * The character encoding name given must be the name of the character > > encoding used to serialise the file. > > Please add a note here explaining that as a consequence of the presence of the > UTF-8 BOM, the only permitted value for the internal encoding declaration is > UTF-8. I don't understand the request. > > * The value must be a valid character encoding name, and must be the > > preferred name for that encoding. [IANACHARSET] > > As a practical matter, are conformance checkers expected to distinguish > between encoding names that cannot be encoding names due to form and encoding > names that could be encoding names but are unknown? There's no requirement to that effect. I don't see that it would make any particular useful different UI-wise. > > * The encoding name must be serialised without the use of character > > entity references or character escapes of any kind. > > I assume this layer-violating requirement is here in order to keep the prescan > simple. In that case, the requirement should not apply only to the encoding > name but to the entire attribute value containing the encoding name. Done. > > If the document does not start with a BOM, and if its encoding is not > > explicitly given by Content-Type metadata, then the character encoding > > used must be a superset of US-ASCII (specifically, ANSI_X3.4-1968) for > > bytes in the range 0x09 - 0x0D, 0x20, 0x21, 0x22, 0x26, 0x27, 0x2C - > > 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A , and, in addition, if that > > encoding isn't US-ASCII itself, then the encoding must be specified > > using a meta element with a charset attribute or a meta element in the > > Encoding declaraton state. > > Please add that encodings that aren't supersets of US-ASCII must never > appear in a meta even if the encoding was also unambiguously given by > Content-Type metadata or the BOM. Why? > > Authors should not use JIS_X0212-1990, x-JIS0208, and encodings based > > on EBCDIC. > > It would be nice to have a note about a good heuristic for detecting > whether an encoding is EBCDIC-based. Right now I'm using if not ascii > superset and starts with cp, ibm or x-ibm. I agree that it would be nice; I've no idea what a good heuristic would be. I think a comprehensive list might be the only real solution. > [...] > > Authors are encouraged to use UTF-8. Conformance checkers may advise > > against authors using legacy encodings. > > It might be good to have a note about the badness with form submission > and not-quite-IRI processing that non-UTF-8 encodings cause. I'm not really sure what such a note would consist of. Could you send a separate e-mail on this topic? > > 8.2.2.4. Changing the encoding while parsing > > > > When the parser requires the user agent to change the encoding, it must run > > the following steps. > [...] > > 1. If the new encoding is UTF-16, change it to UTF-8. > > Please be specific about UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE. What about them? > Also, changing to encoding to something that is not an US-ASCII superset > should probably not happen (I haven't tested, though) and trigger an error. I don't really understand what you mean. If a document is tentatively assumed to be some variant A of EBCDIC, but the <meta> later turns out to say that it is some variant B, why shouldn't we switch? > > While the invocation of this algorithm is not a parse error, it is > > still indicative of non-conforming content. > > This a bit annoying from an implementor point of view. It's a "left as > exercise to the reader" kind of note. That's not good for interop, > because the reader might fail the exercise. Would you rather the note wasn't there? I don't really know what to point to from that note. > Also, as it stands, the conformance of documents that aren't ASCII-only > and that don't have external encoding information or a BOM is really > fuzzy. Surely, it can't be right that a document that is not encoded in > windows-1252 but contains a couple of megabytes of ASCII-only junk > (mainly space characters) before the encoding meta is conforming if > there's no other fault. Surely, a document that causes a reparse a > couple of megabytes down the road is broken if not downright malicious. Why? It is technically possible (and indeed, wise, as such pages are common) to implement a mechanism that can switch encodings on the fly. The cost need not be high. Why would the encoding declaration coming after a multimegabyte comment be a problem? > I'd still like to keep a notion of conformance that makes all conforming > HTML5 documents parseable with a truly streaming parser. (True streaming > may involve buffering n bytes up front with a reasonable spec-set value > of n but may not involve arbitrary buffering depending on the input.) > > Perhaps the old requirement of having the internal encoding meta (the > whole tag--not just the key attribute) within the first 512 bytes of the > document wasn't such a bad idea after all, even though it does make it > harder to write a serializer that produces only conforming output even > when the input of the serializer is malicious. > > In that case, conformance could be defined so that a document is non- > conforming if changing the encoding while parsing actually would change > the encoding even when the prescan was run on exactly 512 bytes. That seems like a really weird conformance requirement. > > A start tag whose tag name is "meta" > > > > Insert an HTML element for the token. Immediately pop the current > > node off the stack of open elements. > > > > If the element has a charset attribute, and its value is a > > supported encoding, and the confidence is currently tentative, then > > change the encoding to the encoding given by the value of the charset > > attribute. > > Even though it's not strictly necessary to say it here, it would be > helpful to put a note here saying that if the confidence is already > certain and the meta-declared encoding after resolving aliases and magic > superset promotions does not equal the encoding in use or returns an > encoding that isn't a superset of US-ASCII, this is a parse error. It's not a parse error, it's an error in the encoding declaration as given in that section. (The spec is long enough already that I don't want to start making it longer by saying things twice.) > > Otherwise, if the element has a content attribute, and applying > > the algorithm for extracting an encoding from a Content-Type to its > > value returns a supported encoding encoding, and the confidence is > > currently tentative, then change the encoding to the encoding > > encoding. > > What if the algorithm for extracting the encoding returns a value but it > isn't a supported encoding. Shouldn't that count as an error of some > kind? Document conformance can't be dependent on the UA it is parsed by. UA conformance can't be dependent on the documents it parses. I don't see what kind of error it could be. -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 22 May 2008 22:29:47 UTC