[whatwg] Internal character encoding declaration from Ian Hickson on 2006-03-10 (public-whatwg-archive@w3.org from March 2006)

From: Ian Hickson <ian@hixie.ch>
Date: Fri, 10 Mar 2006 20:49:09 +0000 (UTC)
Message-ID: <Pine.LNX.4.62.0603102026120.315@dhalsim.dreamhost.com>
On Mon, 8 Aug 2005, Henri Sivonen wrote:
>
> Quoting from WA1 draft section 2.2.5.1. Specifying and establishing the
> document's character encoding:
> 
> > The meta element may also be used, in HTML only (not in XHTML) to provide
> > UAs with character encoding information for the file. To do this, the meta
> > element must be the first element in the head element,
> 
> To cater for implementations that consume the byte stream only once in all
> cases and do not rewind the input and restart the parser upon discovering the
> meta, [...]

I'm actually considering just requiring that UAs support rewinding (by 
defining the exact semantics of how to parse for the <meta> header). Is 
this something people would object to?


> I think it would be beneficial to additionally stipulate that
> 1. The meta element-based character encoding information declaration is
> expected to work only if the Basic Latin range of characters maps to the same
> bytes as in the US-ASCII encoding.

Is this realistic? I'm not really familiar enough with character encodings 
to say if this is what happens in general.


> 2. If there is no external character encoding information nor a BOM (see 
> below), there MUST NOT be any non-ASCII bytes in the document byte 
> stream before the end of the meta element that declares the character 
> encoding. (In practice this would ban unescaped non-ASCII class names on 
> the html and [head] elements and non-ASCII comments at the beginning of 
> the document.)

Again, can we realistically require this? I need to do some studies of 
non-latin pages, I guess.


> > it must have the http-equiv attribute set to the literal value 
> > Content-Type,
> 
> I think case-insensitivity should be allowed in the string 
> "Content-Type", because there is legacy precedent for that and HTTP 
> defines header names as case-insensitive.
>
> > and must have the content attribute set to the literal value text/html;
> > charset=
> 
> That string should be case-insensitive as well, because HTTP defines it
> case-insensitive.

Yeah. I've made a note in the spec to that effect.



> Also, should zero or more white space characters be allowed before ';' 
> and around '=' and should the space after ';' be one or more white space 
> characters? HTTP-wise yes, but would it lead to real-world 
> incompatibilities? (I have not tested.)

Well, don't forget the parsing side of this will be ridiculously more lax 
than this, the authoring side.


> > Authors should avoid including inline character encoding information. 
> > Character encoding information should instead be included at the 
> > transport level (e.g. using the HTTP Content-Type header).
> 
> I disagree.
> 
> With HTML with contemporary UAs, there is no real harm in including the 
> character encoding information both on the HTTP level and in the meta as 
> long as the information is not contradictory. On the contrary, the 
> author-provided internal information is actually useful when end users 
> save pages to disk using UAs that do not reserialize with internal 
> character encoding information.

...and it breaks everything when you have a transcoding proxy, or similar. 
Character encoding information shouldn't be duplicated, IMHO, that's just 
asking for trouble.

I suppose the spec could stay silent on this though.


> > For HTML, user agents must use the following algorithm in determining the
> > character encoding of a document:
> > 1. If the transport layer specifies an encoding, use that.
> 
> Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; UTF-32
> makes no practical sense for interchange on the Web.)

I don't know, should there?


> > 2. Otherwise, if the user agent can find a meta element that specifies
> > character encoding information (as described above), then use that.
> 
> If a conformance checker has not determined the character encoding by 
> now, what should it do? Should it report the document as non-conforming 
> (my preferred choice)? Should it default to US-ASCII and report any 
> non-ASCII bytes as conformance errors? Should it continue to the fuzzier 
> steps like browsers would (hopefully not)?

Again, I don't know. This entire section needs dramatically more text, 
which I don't really feel fully qualified to write.


> > 4. Otherwise, use an implementation-defined or user-specified default 
> > character encoding (ISO-8859-1, windows-1252, and UTF-8 are 
> > recommended as defaults, and can in many cases be identified by 
> > inspection as they have different ranges of valid bytes).
> 
> I think it does not make sense to recommend ISO-8859-1, because 
> windows-1252 is always a better guess in practice. In the context of 
> HTML, UTF-8 looks like a weird default considering years of precedent 
> with the de facto windows-1252 default. (Of course, if the UA is willing 
> to examine the entire byte stream before parsing, UTF-8 can be detected 
> very reliably.)

All three can be detected reliably, given a large enough sample size, 
since they have non-overlapping representations. We could just default to 
win1252 though, indeed.


Currently the behaviour is very underspecified here:

   http://whatwg.org/specs/web-apps/current-work/#documentEncoding

I'd like to rewrite that bit. It will require a lot of research; of 
existing authoring practices, of current UAs, and of author needs. If 
anyone wants to step up and do the work, I'd be very happy to work with 
them and get something sorted out here.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Friday, 10 March 2006 12:49:09 UTC