Re: japanese encoding nightmare from Daniel Barclay on 2006-11-16 (public-evangelist@w3.org from November 2006)

From: Daniel Barclay <daniel@fgm.com>
Date: Thu, 16 Nov 2006 14:13:38 -0500
To: Mike Schinkel <mikeschinkel@gmail.com>, public-evangelist@w3.org
Message-ID: <455CB862.9030202@fgm.com>
Mike Schinkel wrote:
> Daniel Barclay wrote: 
> 
>>> Remember that <META HTTP-EQUIV="..." ...> elements are not supposed

I should narrow that to "some ... elements "

>>> to be read by the browser when the browser retrieved the document
>>> from a server. 
>>> Such META elements are for the server to read and use to construct
>>> real HTTP header fields (if the server chooses that mechanism).
> 
> I recently read (from what I remember to be an authoritative source) that in
> practice servers rarely ever read them because of performance so the browser
> has to. 

In some cases, the browser is not even allowed to use them.

If the server indicates the content type and character encoding
("charset") in the HTTP response, the browser must use _that_ type and
charset and must _not_ use values from a <META HTTP-EQUIV="Content-Type"
...> element or anything else in the returned entity (document) to
determine the type and charset.  That is, the server's HTTP headers
override any specifications inside the entity.

A server is supposed to be able to change the encoding of a document as
long as it reports the encoding correctly in the Content-Type header.
It is not supposed to have to change any <META HTTP-EQUIV="Content-Type"
...> elements.

(Besides requiring any transcoding server to understand HTML, changing
such elements would be changing the _contents_ of the document, not just
changing its _encoding_ (changing the sequence of characters, not just
changing the bytes that encode the characters).)

If the browser ignored the Content-Type header from the server and read
a <META HTTP-EQUIV="Content-Type" ...> element, it might be trying to
use the wrong encoding.


I thought that any browser that behaved differently (say, IE 6,
which sometimes ignores "text/plain" from the server) violated some
specification.

However, looking at the HTML 4.01 specification, I only see wording
about servers' being allowed to read such element:
- "HTTP servers use this attribute to gather information for HTTP
   response message headers"
- "HTTP servers may use the property name specified by the http-equiv
   attribute to create an [RFC822]-style header in the HTTP response."

Evidently my source was something else.  I don't remember which
document it was, so I don't know whether it was as authoritative as
a specification.  (I do think it was something from the W3C.)


Note that XML has similar a rule regarding the character encoding
specified inside an XML document in the XML declaration ("<?xml
encoding='...'?>").  If the character encoding is specified to the
XML processor at a higher level (e.g., via an HTTP Content-Type
header), then the processor must ignore the character encoding
specification in the XML declaration.

(Again, I can't find that in the XML specification itself, so I
can't currently vouch for the authoritativeness of my source.)


Of course, that's all about the content type and encoding.  Since I
don't recall my source, I can't say whether most HTTP-EQUIV elements
are like Content-Type (the browser must _not_ use them) or not (the
browser can use them).


> This http://www.w3.org/TR/html4/struct/global.html#adef-http-equiv says
> (emphasis mine): "HTTP servers *MAY* use the property name specified by the
> http-equiv attribute to create an [RFC822]-style header in the HTTP
> response."  That would imply they might not, and if so the browser would
> have to handle, no?

Not quite.

It's not a server's not reading HTTP-EQUIV information from inside an
HTML document that might imply that the browser should read it.

If the server read more-authoritative information from elsewhere (e.g.,
a server configuration file describing the documents to be served out)
and reported it in an HTTP header, then the browser should not ignore
its more-authoritative source (the server HTTP response header) and
instead read an less-authoritative source (the insides of the document).


However, it might be a server's not sending a header at all that implies
that the browser can (or maybe should) use HTTP-EQUIV information.

(I'm not sure that there's not a case where the server can choose to not
return a certain header and where the browser should take that lack of
a header as authoritative.)



Daniel
Received on Thursday, 16 November 2006 19:14:06 UTC