Re: Details on internal encoding declarations from Ian Hickson on 2008-05-22 (public-html@w3.org from May 2008)

From: Ian Hickson <ian@hixie.ch>
Date: Thu, 22 May 2008 22:29:03 +0000 (UTC)
To: Henri Sivonen <hsivonen@iki.fi>
Cc: HTML WG <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0805222215470.12911@hixie.dreamhostps.com>
On Thu, 20 Mar 2008, Henri Sivonen wrote:
> > 3.7.5.4. Specifying the document's character encoding
> > 
> > A character encoding declaration is a mechanism by which the character
> > encoding used to store or transmit a document is specified.
> > 
> > The following restrictions apply to character encoding declarations:
> > 
> >     * The character encoding name given must be the name of the character
> > encoding used to serialise the file.
> 
> Please add a note here explaining that as a consequence of the presence of the
> UTF-8 BOM, the only permitted value for the internal encoding declaration is
> UTF-8.

I don't understand the request.


> >     * The value must be a valid character encoding name, and must be the
> > preferred name for that encoding. [IANACHARSET]
> 
> As a practical matter, are conformance checkers expected to distinguish
> between encoding names that cannot be encoding names due to form and encoding
> names that could be encoding names but are unknown?

There's no requirement to that effect. I don't see that it would make any 
particular useful different UI-wise.


> >     * The encoding name must be serialised without the use of character
> > entity references or character escapes of any kind.
> 
> I assume this layer-violating requirement is here in order to keep the prescan
> simple. In that case, the requirement should not apply only to the encoding
> name but to the entire attribute value containing the encoding name.

Done.


> > If the document does not start with a BOM, and if its encoding is not 
> > explicitly given by Content-Type metadata, then the character encoding 
> > used must be a superset of US-ASCII (specifically, ANSI_X3.4-1968) for 
> > bytes in the range 0x09 - 0x0D, 0x20, 0x21, 0x22, 0x26, 0x27, 0x2C - 
> > 0x3F, 0x41 - 0x5A, and 0x61 - 0x7A , and, in addition, if that 
> > encoding isn't US-ASCII itself, then the encoding must be specified 
> > using a meta element with a charset attribute or a meta element in the 
> > Encoding declaraton state.
> 
> Please add that encodings that aren't supersets of US-ASCII must never 
> appear in a meta even if the encoding was also unambiguously given by 
> Content-Type metadata or the BOM.

Why?


> > Authors should not use JIS_X0212-1990, x-JIS0208, and encodings based 
> > on EBCDIC.
> 
> It would be nice to have a note about a good heuristic for detecting 
> whether an encoding is EBCDIC-based. Right now I'm using if not ascii 
> superset and starts with cp, ibm or x-ibm.

I agree that it would be nice; I've no idea what a good heuristic would 
be. I think a comprehensive list might be the only real solution.


> [...]
> > Authors are encouraged to use UTF-8. Conformance checkers may advise 
> > against authors using legacy encodings.
> 
> It might be good to have a note about the badness with form submission 
> and not-quite-IRI processing that non-UTF-8 encodings cause.

I'm not really sure what such a note would consist of. Could you send a 
separate e-mail on this topic?


> > 8.2.2.4. Changing the encoding while parsing
> > 
> > When the parser requires the user agent to change the encoding, it must run
> > the following steps.
> [...]
> > 1. If the new encoding is UTF-16, change it to UTF-8.
> 
> Please be specific about UTF-16BE, UTF-16LE, UTF-32, UTF-32BE and UTF-32LE.

What about them?


> Also, changing to encoding to something that is not an US-ASCII superset
> should probably not happen (I haven't tested, though) and trigger an error.

I don't really understand what you mean. If a document is tentatively 
assumed to be some variant A of EBCDIC, but the <meta> later turns out to 
say that it is some variant B, why shouldn't we switch?


> > While the invocation of this algorithm is not a parse error, it is 
> > still indicative of non-conforming content.
> 
> This a bit annoying from an implementor point of view. It's a "left as 
> exercise to the reader" kind of note. That's not good for interop, 
> because the reader might fail the exercise.

Would you rather the note wasn't there? I don't really know what to point 
to from that note.


> Also, as it stands, the conformance of documents that aren't ASCII-only 
> and that don't have external encoding information or a BOM is really 
> fuzzy. Surely, it can't be right that a document that is not encoded in 
> windows-1252 but contains a couple of megabytes of ASCII-only junk 
> (mainly space characters) before the encoding meta is conforming if 
> there's no other fault. Surely, a document that causes a reparse a 
> couple of megabytes down the road is broken if not downright malicious.

Why? It is technically possible (and indeed, wise, as such pages are 
common) to implement a mechanism that can switch encodings on the fly. The 
cost need not be high. Why would the encoding declaration coming after a 
multimegabyte comment be a problem?


> I'd still like to keep a notion of conformance that makes all conforming 
> HTML5 documents parseable with a truly streaming parser. (True streaming 
> may involve buffering n bytes up front with a reasonable spec-set value 
> of n but may not involve arbitrary buffering depending on the input.)
> 
> Perhaps the old requirement of having the internal encoding meta (the 
> whole tag--not just the key attribute) within the first 512 bytes of the 
> document wasn't such a bad idea after all, even though it does make it 
> harder to write a serializer that produces only conforming output even 
> when the input of the serializer is malicious.
> 
> In that case, conformance could be defined so that a document is non- 
> conforming if changing the encoding while parsing actually would change 
> the encoding even when the prescan was run on exactly 512 bytes.

That seems like a really weird conformance requirement.


> > A start tag whose tag name is "meta"
> > 
> >     Insert an HTML element for the token. Immediately pop the current 
> > node off the stack of open elements.
> > 
> >     If the element has a charset attribute, and its value is a 
> > supported encoding, and the confidence is currently tentative, then 
> > change the encoding to the encoding given by the value of the charset 
> > attribute.
> 
> Even though it's not strictly necessary to say it here, it would be 
> helpful to put a note here saying that if the confidence is already 
> certain and the meta-declared encoding after resolving aliases and magic 
> superset promotions does not equal the encoding in use or returns an 
> encoding that isn't a superset of US-ASCII, this is a parse error.

It's not a parse error, it's an error in the encoding declaration as given 
in that section. (The spec is long enough already that I don't want to 
start making it longer by saying things twice.)


> >     Otherwise, if the element has a content attribute, and applying 
> > the algorithm for extracting an encoding from a Content-Type to its 
> > value returns a supported encoding encoding, and the confidence is 
> > currently tentative, then change the encoding to the encoding 
> > encoding.
> 
> What if the algorithm for extracting the encoding returns a value but it 
> isn't a supported encoding. Shouldn't that count as an error of some 
> kind?

Document conformance can't be dependent on the UA it is parsed by.
UA conformance can't be dependent on the documents it parses.

I don't see what kind of error it could be.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Thursday, 22 May 2008 22:29:47 UTC