Re: Details on internal encoding declarations from Henri Sivonen on 2008-05-23 (public-html@w3.org from May 2008)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Fri, 23 May 2008 12:10:43 +0300
To: Ian Hickson <ian@hixie.ch>
Cc: HTML WG <public-html@w3.org>
Message-Id: <95B9EEDC-2E5B-4652-96BD-631253D5D72A@iki.fi>
On May 23, 2008, at 01:29, Ian Hickson wrote:

> On Thu, 20 Mar 2008, Henri Sivonen wrote:
>>> 3.7.5.4. Specifying the document's character encoding
>>>
>>> A character encoding declaration is a mechanism by which the  
>>> character
>>> encoding used to store or transmit a document is specified.
>>>
>>> The following restrictions apply to character encoding declarations:
>>>
>>>    * The character encoding name given must be the name of the  
>>> character
>>> encoding used to serialise the file.
>>
>> Please add a note here explaining that as a consequence of the  
>> presence of the
>> UTF-8 BOM, the only permitted value for the internal encoding  
>> declaration is
>> UTF-8.
>
> I don't understand the request.

I meant adding a note like this:
"Note: If the first three bytes in the file form a UTF-8 BOM, the only  
permitted encoding name is 'UTF-8'."

>>>    * The value must be a valid character encoding name, and must  
>>> be the
>>> preferred name for that encoding. [IANACHARSET]
>>
>> As a practical matter, are conformance checkers expected to  
>> distinguish
>> between encoding names that cannot be encoding names due to form  
>> and encoding
>> names that could be encoding names but are unknown?
>
> There's no requirement to that effect. I don't see that it would  
> make any
> particular useful different UI-wise.

OK. Good. Less code. :-)

>>> If the document does not start with a BOM, and if its encoding is  
>>> not
>>> explicitly given by Content-Type metadata, then the character  
>>> encoding
>>> used must be a superset of US-ASCII (specifically, ANSI_X3.4-1968)  
>>> for
>>> bytes in the range 0x09 - 0x0D, 0x20, 0x21, 0x22, 0x26, 0x27, 0x2C -
>>> 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A , and, in addition, if that
>>> encoding isn't US-ASCII itself, then the encoding must be specified
>>> using a meta element with a charset attribute or a meta element in  
>>> the
>>> Encoding declaraton state.
>>
>> Please add that encodings that aren't supersets of US-ASCII must  
>> never
>> appear in a meta even if the encoding was also unambiguously given by
>> Content-Type metadata or the BOM.
>
> Why?

Because the meta will never work. It will never be useful but being  
lead to believe that it were may cause trouble.

>>> Authors should not use JIS_X0212-1990, x-JIS0208, and encodings  
>>> based
>>> on EBCDIC.
>>
>> It would be nice to have a note about a good heuristic for detecting
>> whether an encoding is EBCDIC-based. Right now I'm using if not ascii
>> superset and starts with cp, ibm or x-ibm.
>
> I agree that it would be nice; I've no idea what a good heuristic  
> would
> be. I think a comprehensive list might be the only real solution.

I'd expect compatibility considerations in the design of EBCDIC  
variants to have lead to a heuristic being feasible, but I don't know  
what the right heuristic is.

>>> 8.2.2.4. Changing the encoding while parsing
>>>
>>> When the parser requires the user agent to change the encoding, it  
>>> must run
>>> the following steps.
>> [...]
>>> 1. If the new encoding is UTF-16, change it to UTF-8.
>>
>> Please be specific about UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and  
>> UTF-32LE.
>
> What about them?

Please state explicitly for each one if they MUST or MUST NOT change  
to UTF-8.

>> Also, changing to encoding to something that is not an US-ASCII  
>> superset
>> should probably not happen (I haven't tested, though) and trigger  
>> an error.
>
> I don't really understand what you mean. If a document is tentatively
> assumed to be some variant A of EBCDIC, but the <meta> later turns  
> out to
> say that it is some variant B, why shouldn't we switch?

How would an EBCDIC variant ever be a tentative encoding considering  
that it doesn't make sense for Web-oriented heuristic sniffers to ever  
return an EBCDIC variant as the guess? It seems to me that  
*reasonable* tentative encodings are ASCII supersets given Web reality.

>>> While the invocation of this algorithm is not a parse error, it is
>>> still indicative of non-conforming content.
>>
>> This a bit annoying from an implementor point of view. It's a "left  
>> as
>> exercise to the reader" kind of note. That's not good for interop,
>> because the reader might fail the exercise.
>
> Would you rather the note wasn't there? I don't really know what to  
> point
> to from that note.

If the situation doesn't always indicate that the content is non- 
conforming, yes, I'd prefer the note not to be there. If it always  
indicates that the content is non-conforming, I'd like the note to  
explain why.

Otherwise readers like me end up spending non-trivial time trying to  
understand what the note is implying.

>> Also, as it stands, the conformance of documents that aren't ASCII- 
>> only
>> and that don't have external encoding information or a BOM is really
>> fuzzy. Surely, it can't be right that a document that is not  
>> encoded in
>> windows-1252 but contains a couple of megabytes of ASCII-only junk
>> (mainly space characters) before the encoding meta is conforming if
>> there's no other fault. Surely, a document that causes a reparse a
>> couple of megabytes down the road is broken if not downright  
>> malicious.
>
> Why?

Because it kills performance but the author could always avoid killing  
performance.

> It is technically possible (and indeed, wise, as such pages are
> common) to implement a mechanism that can switch encodings on the  
> fly. The
> cost need not be high.

I implemented on-the-fly decoder switching, but I removed the code  
because it virtually never ran but keeping track of whether it still  
could be run for a given stream incurred a performance penalty all the  
time. The reason why on-the-fly decoder switching doesn't work is that  
an efficient implementation converts a  buffer of a couple of  
thousand  bytes/characters in one go, so non-ASCII characters in the  
body of the document will have been misconverted by the time a meta is  
seen.

> Why would the encoding declaration coming after a
> multimegabyte comment be a problem?

Because tearing down the tree for and reparsing multiple megabytes is  
takes excessive time compared to the sane case.

>> I'd still like to keep a notion of conformance that makes all  
>> conforming
>> HTML5 documents parseable with a truly streaming parser. (True  
>> streaming
>> may involve buffering n bytes up front with a reasonable spec-set  
>> value
>> of n but may not involve arbitrary buffering depending on the input.)
>>
>> Perhaps the old requirement of having the internal encoding meta (the
>> whole tag--not just the key attribute) within the first 512 bytes  
>> of the
>> document wasn't such a bad idea after all, even though it does make  
>> it
>> harder to write a serializer that produces only conforming output  
>> even
>> when the input of the serializer is malicious.
>>
>> In that case, conformance could be defined so that a document is non-
>> conforming if changing the encoding while parsing actually would  
>> change
>> the encoding even when the prescan was run on exactly 512 bytes.
>
> That seems like a really weird conformance requirement.

It's a requirement to use to having the spec. :-)

>>>    Otherwise, if the element has a content attribute, and applying
>>> the algorithm for extracting an encoding from a Content-Type to its
>>> value returns a supported encoding encoding, and the confidence is
>>> currently tentative, then change the encoding to the encoding
>>> encoding.
>>
>> What if the algorithm for extracting the encoding returns a value  
>> but it
>> isn't a supported encoding. Shouldn't that count as an error of some
>> kind?
>
> Document conformance can't be dependent on the UA it is parsed by.
> UA conformance can't be dependent on the documents it parses.
>
> I don't see what kind of error it could be.

This is a huge theoretical hole in any spec that allows an open-ended  
selection of character encodings. I suppose this could be  
characterized as an error that is in the same class as the network  
connection dropping while the document is being read.

(A simpler way to deal with this would be making the set of permitted  
encodings closed. After all, anything in the set in addition to UTF-8  
is just legacy extra.)

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Friday, 23 May 2008 09:11:27 UTC