W3C home > Mailing lists > Public > public-html-comments@w3.org > March 2011

Re: Encoding interaction of HTTP response header and META tag

From: Jukka K. Korpela <jukka.k.korpela@kolumbus.fi>
Date: Thu, 3 Mar 2011 23:13:08 +0200
Message-ID: <FD5750C4E13C41518890F07D023D30C2@JukanPC>
To: <public-html-comments@w3.org>
Wayne Pollock wrote:

> If document authors
> goes to the trouble of stating the charset in the HEAD of their
> document,
> that that should override any default set by the web sever.

I have much sympathy for the idea, for reasons you gave, especially the 
reason that web server admins often disallow the effects of .htaccess files, 
effectively enforcing their settings on every authors.

However, I'm afraid it's too late; the change would break a long tradition 
and would break existing pages.

> It is a huge
> burden to webmasters everywhere to have to manually set the charset
> for every update to their website.

I can't see what you mean by that. The settings need to be checked when you 
start creating a site, not after every update.

> TO OVERRIDE THE DEFAULT CHARSET RETURNED BY APACHE, A PER FILE
> DIRECTIVE MUST
> BE USED TO SPECIFY EACH FILE'S CHARSET.

Pardon? Apache settings operate per filename extension, and mostly it 
suffices to set the encoding for just one extension, ".html".

> Such overriding is possible but
> to allow web authors the ability to do so, per directory settings
> must be  enabled (the ".htaccess" files).  doing so severely impacts 
> server
> performance

I don't think it has any significant impact on performance.

> and many sites simply can't do so, so web pages WILL be send with the
> wrong charset.

Well, I would put it this way: If the server admin disallows the effects of 
your .htaccess file, then it's just something you need to live with it. If 
the force your HTML documents to be served with headers saying that the 
encoding is iso-8859-1, or utf-8, or whatever, then just make it so

> This should be a simple fix.  The issue was raised on the WHATWG list
> and elsewhere, and noboby could think of an objection to this
> proposal.

I think a more specific citation of previous discussions would be needed.

> (The
> only web pages that could "break" with this change were already
> broken.)

99% of web pages are broken, in the sense of not complying with HTML, CSS, 
WCAG 1.0, or other relevant recommendations. When we worry about what 
happens to existing pages, we need to worry about more or less broken pages, 
mostly.

Consider a page on a server that forces Content-Type: text/html; 
charset=utf-8 on all HTML files. Such servers are increasingly common. 
Authors have had to accommodate to that, for example saving documents in 
utf-8 encoding if needed. The pages may well have <meta> tags announcing 
iso-8859-1 or something else, maybe because some web page editing software 
emitted it, or it belonged to a sample file used as a starting point, or the 
author copied it from somewhere, with little or no understanding of its 
effect.

Your proposal, if accepted and implemented, would imply that all such pages 
stopped working, if they (literally) contain any character outside the ASCII 
range. This might mean a mess that everyone can see, or just one character 
might be wrong, or anything between these extremes.

> On a related note, the new structural tags that denote articles and
> such
> should allow an optional CHARSET attribute.  A web page with ARTICLEs
> etc. may be (and may likely be) composed of content from many sources,
> e.g., a "mash-up".  While CMS and blogging software could force a
> single
> charset so there is only one charset per web page,  that seems an
> unnecessary restriction (and I don't know that most blogging software
> works that way).

No, it's an inherent restriction. The idea of allowing different character 
encodings within a single document has often been suggested, but it's based 
on a misunderstanding. Changing the encoding at a higher protocol level 
conflicts with the basic modern model of using character data. Recognizing 
encoding from meta tags is admittedly in conflict with it, too, but it was a 
more or less unavoidable exception, which has been separately defined (and 
is still known to cause problems, especially when people don't understand 
how it works and place it too late in the document). - "Mash-up" simply 
needs to recode when needed.

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/ 
Received on Thursday, 3 March 2011 21:14:03 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 June 2011 00:14:06 GMT