W3C home > Mailing lists > Public > www-html@w3.org > June 2006

Re: Problem in publishing multilingual HTML document on web in UTF-8 encoding

From: David Woolley <david@djwhome.demon.co.uk>
Date: Thu, 1 Jun 2006 22:37:57 +0100 (BST)
Message-Id: <200606012137.k51LbvY01266@djwhome.demon.co.uk>
To: www-html@w3.org

> 
> http://thread.gmane.org/gmane.user-groups.linux.delhi/12845/focus=12845

The HTML 4.01 specification overrides the HTTP specification and says
that the character set is undefined, not ISO 8859/1, when none is
specified by other means.  It also allows browsers to use heuristics,
so one heurstic might be to assume ISO 8859/1!

In practice, servers don't honour meta elements.  However, browsers
are required to do so, for this special case, if there is no charset
in the real HTTP headers, so one gets the same result.  Consequently,
if this really were HTML, there would be no problem - in fact many high
profile sites use UTF-8 with only meta elements to specify it.

However, in this case, you aren't using HTML, but XHTML.  In my view,
it is almost certain that you are doing so for unsound reasons, but
there are rules for the character set in XML and in fact the default
is already UTF-8!  However, it is likely that you are actually serving
to Internet Explorer, which doesn't support XHTML, so you've had to
serve it with headers that say that it is HTML.  In fact, your meta
element also says that it is HTML.  You therefore have a confused 
situation where you are relying on browser error recovery to treat
a document written in XHTML as though it were broken HTML.  I'd suggest
the first thing to do is to convert to XHTML 4.01 to eliminate the
error recovery aspects.

You would get a problem if the server had the character set explicitly
set, but that is extremely rare even when in regions that require a
non-default setting and where authors normally fail to use the meta
route, relying on users to have their browser set to assume the local
character set.

However, in this case, that is exactly your problem!  If you want to
use this server, and you cannot convince them to remove the charset
from the headers, you will need use entities to encode the 
non-ISO 8859/1 characters.  Some authoring tools, such as Mozilla, will
allow you to save in different character sets and will automatically
do the required entity encoding.
 
> Sorry for any inconvenience, but I think I've found a bug in HTML
> specification (which might be prevalent in XHTML specifications also).
> Not necessarily a bug, but a correction that needs to be done in the
> HTML specification.

You've failed to specify what you think the problem is, so I've
had to try and analyze from the thread you referenced.

HTTP/1.0 200 OK
Age: 103
Date: Thu, 01 Jun 2006 21:27:47 GMT
Content-Length: 1490
Content-Type: text/html; charset=iso-8859-1
Server: Apache/1.3.34 (Debian) PHP/4.4.2-1+b1 mod_choke/0.06
X-mod-choke: 0.06
Last-Modified: Wed, 26 Apr 2006 22:56:06 GMT
ETag: "a808c-5d2-444ffa86"
Received on Thursday, 1 June 2006 21:38:09 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 27 March 2012 18:16:06 GMT