Encoding feedback

On Fri, 23 May 2008, Henri Sivonen wrote:
> On May 23, 2008, at 01:29, Ian Hickson wrote:
> > On Thu, 20 Mar 2008, Henri Sivonen wrote:
> > > > 3.7.5.4. Specifying the document's character encoding
> > > > 
> > > > A character encoding declaration is a mechanism by which the character
> > > > encoding used to store or transmit a document is specified.
> > > > 
> > > > The following restrictions apply to character encoding declarations:
> > > > 
> > > >    * The character encoding name given must be the name of the character
> > > >      encoding used to serialise the file.
> > > 
> > > Please add a note here explaining that as a consequence of the 
> > > presence of the UTF-8 BOM, the only permitted value for the internal 
> > > encoding declaration is UTF-8.
> > 
> > I don't understand the request.
> 
> I meant adding a note like this: "Note: If the first three bytes in the 
> file form a UTF-8 BOM, the only permitted encoding name is 'UTF-8'."

Is that really helpful? I don't really understand what the point of the 
note would be.


> > > > If the document does not start with a BOM, and if its encoding is 
> > > > not explicitly given by Content-Type metadata, then the character 
> > > > encoding used must be a superset of US-ASCII (specifically, 
> > > > ANSI_X3.4-1968) for bytes in the range 0x09 - 0x0D, 0x20, 0x21, 
> > > > 0x22, 0x26, 0x27, 0x2C - 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A , and, 
> > > > in addition, if that encoding isn't US-ASCII itself, then the 
> > > > encoding must be specified using a meta element with a charset 
> > > > attribute or a meta element in the Encoding declaration state.
> > > 
> > > Please add that encodings that aren't supersets of US-ASCII must 
> > > never appear in a meta even if the encoding was also unambiguously 
> > > given by Content-Type metadata or the BOM.
> > 
> > Why?
> 
> Because the meta will never work. It will never be useful but being lead 
> to believe that it were may cause trouble.

Done.


> > > > Authors should not use JIS_X0212-1990, x-JIS0208, and encodings 
> > > > based on EBCDIC.
> > > 
> > > It would be nice to have a note about a good heuristic for detecting 
> > > whether an encoding is EBCDIC-based. Right now I'm using if not 
> > > ascii superset and starts with cp, ibm or x-ibm.
> > 
> > I agree that it would be nice; I've no idea what a good heuristic 
> > would be. I think a comprehensive list might be the only real 
> > solution.
> 
> I'd expect compatibility considerations in the design of EBCDIC variants 
> to have lead to a heuristic being feasible, but I don't know what the 
> right heuristic is.

Not much I can do then. :-)


> > > > 8.2.2.4. Changing the encoding while parsing
> > > > 
> > > > When the parser requires the user agent to change the encoding, it 
> > > > must run the following steps.
> > > [...]
> > > > 1. If the new encoding is UTF-16, change it to UTF-8.
> > > 
> > > Please be specific about UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and 
> > > UTF-32LE.
> > 
> > What about them?
> 
> Please state explicitly for each one if they MUST or MUST NOT change to 
> UTF-8.

I've explicitly shunned UTF-32 now, and will thus not mention it.

I've changed the text to say "a UTF-16 encoding", which is the language 
I've used elsewhere to basically cover anything based on UTF-16.


> > > Also, changing to encoding to something that is not an US-ASCII 
> > > superset should probably not happen (I haven't tested, though) and 
> > > trigger an error.
> > 
> > I don't really understand what you mean. If a document is tentatively 
> > assumed to be some variant A of EBCDIC, but the <meta> later turns out 
> > to say that it is some variant B, why shouldn't we switch?
> 
> How would an EBCDIC variant ever be a tentative encoding considering 
> that it doesn't make sense for Web-oriented heuristic sniffers to ever 
> return an EBCDIC variant as the guess?

If that was the encoding the user last used for that file.


> It seems to me that *reasonable* tentative encodings are ASCII supersets 
> given Web reality.

Past user behaviour would trump that.


> > > > While the invocation of this algorithm is not a parse error, it is 
> > > > still indicative of non-conforming content.
> > > 
> > > This a bit annoying from an implementor point of view. It's a "left 
> > > as exercise to the reader" kind of note. That's not good for 
> > > interop, because the reader might fail the exercise.
> > 
> > Would you rather the note wasn't there? I don't really know what to 
> > point to from that note.
> 
> If the situation doesn't always indicate that the content is 
> non-conforming, yes, I'd prefer the note not to be there. If it always 
> indicates that the content is non-conforming, I'd like the note to 
> explain why.
> 
> Otherwise readers like me end up spending non-trivial time trying to 
> understand what the note is implying.

I've removed the note.


> > > Also, as it stands, the conformance of documents that aren't 
> > > ASCII-only and that don't have external encoding information or a 
> > > BOM is really fuzzy. Surely, it can't be right that a document that 
> > > is not encoded in windows-1252 but contains a couple of megabytes of 
> > > ASCII-only junk (mainly space characters) before the encoding meta 
> > > is conforming if there's no other fault. Surely, a document that 
> > > causes a reparse a couple of megabytes down the road is broken if 
> > > not downright malicious.
> > 
> > Why?
> 
> Because it kills performance but the author could always avoid killing 
> performance.
>
> > It is technically possible (and indeed, wise, as such pages are 
> > common) to implement a mechanism that can switch encodings on the fly. 
> > The cost need not be high.
> 
> I implemented on-the-fly decoder switching, but I removed the code 
> because it virtually never ran but keeping track of whether it still 
> could be run for a given stream incurred a performance penalty all the 
> time. The reason why on-the-fly decoder switching doesn't work is that 
> an efficient implementation converts a buffer of a couple of thousand 
> bytes/characters in one go, so non-ASCII characters in the body of the 
> document will have been misconverted by the time a meta is seen.
> 
> > Why would the encoding declaration coming after a multimegabyte 
> > comment be a problem?
> 
> Because tearing down the tree for and reparsing multiple megabytes is 
> takes excessive time compared to the sane case.
> 
> > > I'd still like to keep a notion of conformance that makes all 
> > > conforming HTML5 documents parseable with a truly streaming parser. 
> > > (True streaming may involve buffering n bytes up front with a 
> > > reasonable spec-set value of n but may not involve arbitrary 
> > > buffering depending on the input.)
> > > 
> > > Perhaps the old requirement of having the internal encoding meta 
> > > (the whole tag--not just the key attribute) within the first 512 
> > > bytes of the document wasn't such a bad idea after all, even though 
> > > it does make it harder to write a serializer that produces only 
> > > conforming output even when the input of the serializer is 
> > > malicious.
> > > 
> > > In that case, conformance could be defined so that a document is 
> > > non- conforming if changing the encoding while parsing actually 
> > > would change the encoding even when the prescan was run on exactly 
> > > 512 bytes.
> > 
> > That seems like a really weird conformance requirement.
> 
> It's a requirement to use to having the spec. :-)

So I entirely sympathise with what you're saying above, but I can't really 
bring myself to making this non-conforming just because it is not optimal. 
I mean, a multimegabyte comment is not going to be conducive to 
performance whatever we do, and we're not going to make that illegal 
either. I think the requirement that it be the first element in the <head> 
is enough.

Feel free to include a warning though.


> > > >    Otherwise, if the element has a content attribute, and applying 
> > > > the algorithm for extracting an encoding from a Content-Type to 
> > > > its value returns a supported encoding encoding, and the 
> > > > confidence is currently tentative, then change the encoding to the 
> > > > encoding encoding.
> > > 
> > > What if the algorithm for extracting the encoding returns a value 
> > > but it isn't a supported encoding. Shouldn't that count as an error 
> > > of some kind?
> > 
> > Document conformance can't be dependent on the UA it is parsed by. UA 
> > conformance can't be dependent on the documents it parses.
> > 
> > I don't see what kind of error it could be.
> 
> This is a huge theoretical hole in any spec that allows an open-ended 
> selection of character encodings. I suppose this could be characterized 
> as an error that is in the same class as the network connection dropping 
> while the document is being read.
> 
> (A simpler way to deal with this would be making the set of permitted 
> encodings closed. After all, anything in the set in addition to UTF-8 is 
> just legacy extra.)

I recommend just saying you don't support the encoding when you run into 
this situation, and not otherwise worrying about it.


On Fri, 23 May 2008, Henri Sivonen wrote:
> > 
> > If the page is detected as UTF-16, the odds of it being anything else 
> > are extremely low,
> 
> I found the situation with a real off-the-shelf detector to be 
> different.

That's unfortunate.


> Perhaps I'm being irrationally emotional here, but I think it's more 
> forgivable to make authoring mistakes with 8-bit legacy encodings than 
> with UTF-16, so I have far less sympathy for making bogus UTF-16 work. I 
> don't have real stats but I'd expect undeclared Cyrillic 8-bit content 
> to be much more common than BOMless UTF-16 content.
>
> > Well, the spec as it stands allows you to limit it to 
> > ASCII-superset-only if you want. However, I've heard from at least one 
> > vendor that they needed to detect UTF-16 (by looking for 00 3C 00 ?? 
> > and 3C 00 ?? 00 as the first four bytes; ?? != 00) to support some 
> > pages. I can't really see that heuristic being triggered by 
> > Windows-1251 pages.
> 
> That's sad, but if the Web requires it, perhaps the spec should mandate 
> that exact heuristic.

I wasn't entirely convinced by the need myself (I'm more in your camp of 
not seeing much UTF-16 at all), but my point is that some people want to 
support non-ASCII-compatible encodings there, and I don't see a good 
reason to prevent them from doing that. Again, that doesn't stop you from 
implementing such a restriction on your end. It's the official open-ended 
heuristic hatch for encoding detection. Do what you want. :-)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Saturday, 24 May 2008 10:46:47 UTC