Re: Problem in publishing multilingual HTML document on web in UTF-8 encoding from David Woolley on 2006-06-02 (www-html@w3.org from June 2006)

From: David Woolley <david@djwhome.demon.co.uk>
Date: Fri, 2 Jun 2006 08:19:51 +0100 (BST)
To: www-html@w3.org
Message-Id: <200606020719.k527Jpt01479@djwhome.demon.co.uk>

> I suspect that he issue has less to do with publishing multilingual HTML
> documents on the web in UTF-8 than the infrastructure that is being used
> to achieve the task. I am aware of many companies that publish

Yes.  My eventual conclusion was that it was the web server configuration.
It may well be that that was well intentioned, and done to make it slightly
more likely that documents would be served with a defined character set
in the primary market area (given that many people hand coding will 
never include the charset parameter).

However, if that was the case, they have failed to consider
internationalisation issues, and they have failed to consider the large
number of documents authored with Microsoft tools that use windows-1252
characters coded as such, or older documents coded with windows-1252,
but lacking a charset specification.  (An explicit character set ought
to turn off auto-detection, although one could argue that the presence
of illegal characters (0x80 to 0x9F) could trigger a heuristic to
ignore the character set.)

Incidentally, one of the main reasons why servers don't honour
meta http-equiv elements is that it represents a layering violation.
A server is about serving resources of all sorts and shouldn't need to
have internal knowledge of particular document languages.  This is even
more true of caching proxies, and why it is pretty pointless to try and
control caching behaviour with meta http-equiv.

Unfortunately there are commercial reasons why low cost web space
doesn't provide the ability to use that space properly, by configuring
meta data, and psychological reasons why people won't learn HTTP as
well as HTML, with the result that, instead of doing things properly,
people find workarounds.  The commercial reasons are partially to lower
the security risk to the server, and partially to encourage the purchase
of premium services.  Unfortunately, rather than encouraging upgrades
it results in workarounds.

> multilingual sites in UTF-8 that work fine with IE.

However, it is still a bad idea to use Appendix C XHTML unless you also
intend to serve it with a proper XHTML media type when talking to compatible
browsers AND it uses name space mixing on those browsers.  Also, if one
does so, one should always specify the character set at both XML and
XHTML levels.  There's much more about the Appendix C mode issues in the
thread from February entitled "Question about XHTML 2.0 and content type".

> document...just in case the user never set that in the page. The
> autodetection has worked well for a number of years.

It was broken for a number of years (and may still be).  If you selected
it the body of printed pages was always blank - you just got page headers.
This was, I seem to remember, acknowledged in the knowledge base.

Received on Friday, 2 June 2006 07:20:07 UTC