- From: Ian Hickson <ian@hixie.ch>
- Date: Sat, 24 May 2008 10:45:38 +0000 (UTC)
- To: Henri Sivonen <hsivonen@iki.fi>
- Cc: HTML WG <public-html@w3.org>
On Fri, 23 May 2008, Henri Sivonen wrote: > On May 23, 2008, at 01:29, Ian Hickson wrote: > > On Thu, 20 Mar 2008, Henri Sivonen wrote: > > > > 3.7.5.4. Specifying the document's character encoding > > > > > > > > A character encoding declaration is a mechanism by which the character > > > > encoding used to store or transmit a document is specified. > > > > > > > > The following restrictions apply to character encoding declarations: > > > > > > > > * The character encoding name given must be the name of the character > > > > encoding used to serialise the file. > > > > > > Please add a note here explaining that as a consequence of the > > > presence of the UTF-8 BOM, the only permitted value for the internal > > > encoding declaration is UTF-8. > > > > I don't understand the request. > > I meant adding a note like this: "Note: If the first three bytes in the > file form a UTF-8 BOM, the only permitted encoding name is 'UTF-8'." Is that really helpful? I don't really understand what the point of the note would be. > > > > If the document does not start with a BOM, and if its encoding is > > > > not explicitly given by Content-Type metadata, then the character > > > > encoding used must be a superset of US-ASCII (specifically, > > > > ANSI_X3.4-1968) for bytes in the range 0x09 - 0x0D, 0x20, 0x21, > > > > 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A , and, > > > > in addition, if that encoding isn't US-ASCII itself, then the > > > > encoding must be specified using a meta element with a charset > > > > attribute or a meta element in the Encoding declaration state. > > > > > > Please add that encodings that aren't supersets of US-ASCII must > > > never appear in a meta even if the encoding was also unambiguously > > > given by Content-Type metadata or the BOM. > > > > Why? > > Because the meta will never work. It will never be useful but being lead > to believe that it were may cause trouble. Done. > > > > Authors should not use JIS_X0212-1990, x-JIS0208, and encodings > > > > based on EBCDIC. > > > > > > It would be nice to have a note about a good heuristic for detecting > > > whether an encoding is EBCDIC-based. Right now I'm using if not > > > ascii superset and starts with cp, ibm or x-ibm. > > > > I agree that it would be nice; I've no idea what a good heuristic > > would be. I think a comprehensive list might be the only real > > solution. > > I'd expect compatibility considerations in the design of EBCDIC variants > to have lead to a heuristic being feasible, but I don't know what the > right heuristic is. Not much I can do then. :-) > > > > 8.2.2.4. Changing the encoding while parsing > > > > > > > > When the parser requires the user agent to change the encoding, it > > > > must run the following steps. > > > [...] > > > > 1. If the new encoding is UTF-16, change it to UTF-8. > > > > > > Please be specific about UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and > > > UTF-32LE. > > > > What about them? > > Please state explicitly for each one if they MUST or MUST NOT change to > UTF-8. I've explicitly shunned UTF-32 now, and will thus not mention it. I've changed the text to say "a UTF-16 encoding", which is the language I've used elsewhere to basically cover anything based on UTF-16. > > > Also, changing to encoding to something that is not an US-ASCII > > > superset should probably not happen (I haven't tested, though) and > > > trigger an error. > > > > I don't really understand what you mean. If a document is tentatively > > assumed to be some variant A of EBCDIC, but the <meta> later turns out > > to say that it is some variant B, why shouldn't we switch? > > How would an EBCDIC variant ever be a tentative encoding considering > that it doesn't make sense for Web-oriented heuristic sniffers to ever > return an EBCDIC variant as the guess? If that was the encoding the user last used for that file. > It seems to me that *reasonable* tentative encodings are ASCII supersets > given Web reality. Past user behaviour would trump that. > > > > While the invocation of this algorithm is not a parse error, it is > > > > still indicative of non-conforming content. > > > > > > This a bit annoying from an implementor point of view. It's a "left > > > as exercise to the reader" kind of note. That's not good for > > > interop, because the reader might fail the exercise. > > > > Would you rather the note wasn't there? I don't really know what to > > point to from that note. > > If the situation doesn't always indicate that the content is > non-conforming, yes, I'd prefer the note not to be there. If it always > indicates that the content is non-conforming, I'd like the note to > explain why. > > Otherwise readers like me end up spending non-trivial time trying to > understand what the note is implying. I've removed the note. > > > Also, as it stands, the conformance of documents that aren't > > > ASCII-only and that don't have external encoding information or a > > > BOM is really fuzzy. Surely, it can't be right that a document that > > > is not encoded in windows-1252 but contains a couple of megabytes of > > > ASCII-only junk (mainly space characters) before the encoding meta > > > is conforming if there's no other fault. Surely, a document that > > > causes a reparse a couple of megabytes down the road is broken if > > > not downright malicious. > > > > Why? > > Because it kills performance but the author could always avoid killing > performance. > > > It is technically possible (and indeed, wise, as such pages are > > common) to implement a mechanism that can switch encodings on the fly. > > The cost need not be high. > > I implemented on-the-fly decoder switching, but I removed the code > because it virtually never ran but keeping track of whether it still > could be run for a given stream incurred a performance penalty all the > time. The reason why on-the-fly decoder switching doesn't work is that > an efficient implementation converts a buffer of a couple of thousand > bytes/characters in one go, so non-ASCII characters in the body of the > document will have been misconverted by the time a meta is seen. > > > Why would the encoding declaration coming after a multimegabyte > > comment be a problem? > > Because tearing down the tree for and reparsing multiple megabytes is > takes excessive time compared to the sane case. > > > > I'd still like to keep a notion of conformance that makes all > > > conforming HTML5 documents parseable with a truly streaming parser. > > > (True streaming may involve buffering n bytes up front with a > > > reasonable spec-set value of n but may not involve arbitrary > > > buffering depending on the input.) > > > > > > Perhaps the old requirement of having the internal encoding meta > > > (the whole tag--not just the key attribute) within the first 512 > > > bytes of the document wasn't such a bad idea after all, even though > > > it does make it harder to write a serializer that produces only > > > conforming output even when the input of the serializer is > > > malicious. > > > > > > In that case, conformance could be defined so that a document is > > > non- conforming if changing the encoding while parsing actually > > > would change the encoding even when the prescan was run on exactly > > > 512 bytes. > > > > That seems like a really weird conformance requirement. > > It's a requirement to use to having the spec. :-) So I entirely sympathise with what you're saying above, but I can't really bring myself to making this non-conforming just because it is not optimal. I mean, a multimegabyte comment is not going to be conducive to performance whatever we do, and we're not going to make that illegal either. I think the requirement that it be the first element in the <head> is enough. Feel free to include a warning though. > > > > Otherwise, if the element has a content attribute, and applying > > > > the algorithm for extracting an encoding from a Content-Type to > > > > its value returns a supported encoding encoding, and the > > > > confidence is currently tentative, then change the encoding to the > > > > encoding encoding. > > > > > > What if the algorithm for extracting the encoding returns a value > > > but it isn't a supported encoding. Shouldn't that count as an error > > > of some kind? > > > > Document conformance can't be dependent on the UA it is parsed by. UA > > conformance can't be dependent on the documents it parses. > > > > I don't see what kind of error it could be. > > This is a huge theoretical hole in any spec that allows an open-ended > selection of character encodings. I suppose this could be characterized > as an error that is in the same class as the network connection dropping > while the document is being read. > > (A simpler way to deal with this would be making the set of permitted > encodings closed. After all, anything in the set in addition to UTF-8 is > just legacy extra.) I recommend just saying you don't support the encoding when you run into this situation, and not otherwise worrying about it. On Fri, 23 May 2008, Henri Sivonen wrote: > > > > If the page is detected as UTF-16, the odds of it being anything else > > are extremely low, > > I found the situation with a real off-the-shelf detector to be > different. That's unfortunate. > Perhaps I'm being irrationally emotional here, but I think it's more > forgivable to make authoring mistakes with 8-bit legacy encodings than > with UTF-16, so I have far less sympathy for making bogus UTF-16 work. I > don't have real stats but I'd expect undeclared Cyrillic 8-bit content > to be much more common than BOMless UTF-16 content. > > > Well, the spec as it stands allows you to limit it to > > ASCII-superset-only if you want. However, I've heard from at least one > > vendor that they needed to detect UTF-16 (by looking for 00 3C 00 ?? > > and 3C 00 ?? 00 as the first four bytes; ?? != 00) to support some > > pages. I can't really see that heuristic being triggered by > > Windows-1251 pages. > > That's sad, but if the Web requires it, perhaps the spec should mandate > that exact heuristic. I wasn't entirely convinced by the need myself (I'm more in your camp of not seeing much UTF-16 at all), but my point is that some people want to support non-ASCII-compatible encodings there, and I don't see a good reason to prevent them from doing that. Again, that doesn't stop you from implementing such a restriction on your end. It's the official open-ended heuristic hatch for encoding detection. Do what you want. :-) -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Saturday, 24 May 2008 10:46:47 UTC