W3C home > Mailing lists > Public > public-html@w3.org > June 2009

Re: Auto-detect and encodings in HTML5

From: Leif Halvard Silli <lhs@malform.no>
Date: Fri, 12 Jun 2009 20:57:04 +0200
Message-ID: <4A32A500.20401@malform.no>
To: Jirka Kosek <jirka@kosek.cz>
CC: Ian Hickson <ian@hixie.ch>, public-html@w3.org
Jirka Kosek On 09-06-12 16.19:
> Leif Halvard Silli wrote:
> 
>> Thus, if a file has the name "file.html.utf8", then UAs should, when
>> reading that file via the file URL protocol give precedence to the
>> encoding expressed by the file suffix.
>>
>> Thus, I would suggest that HTML 5 a) specifies the file suffixes for all
>> the encodings that it endorses

> I don't think that this is good idea:
> 
> First, on majority of systems, files ending with suffixes like .utf8
> will not be simply opened in a web browser at all.

Both .utf8.html and .html.utf8 must of course work - as in Apache.

> Second, filename is too fragile to convey any reasonable metadata about
> content. User can easily change extension and this will affect
> interpretation of encoded text.

Moot: Same "problems" with the meta element. But if invisible meta 
data is a problem, then suffixes are safer & better. And no matter 
what, reality is that MIME type wins. So, it would be nice to be 
able to mimic the MIME type via the file:// protocol.

I recommend you to test the effect of '.xhtml' and '.html' in 
Firefox. To this end, I provide a minimal, valid XHTML document 
with Cyrillic text - please save it a 8-bit Cyrillic encoding.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
	"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ru">
<head><title></title></head><body><p>алфабета</p></body></html>

You will see, that with the '.html' syntax the page (probably) 
defaults to the 8-bit encoding that dominates in your locale. 
Whereas with '.xthml' the page defaults to utf-8. You will also 
see that if you e.g. add a "<br>", then with '.xhtml' you get the 
'yellow screen of death', whereas there is no problem with that if 
the file suffix is '.html'. A clear demo of the difference between 
HTML and XHTML, and a of how little the in document meta data may 
matter.

Is this "too fragile", as you say? It seems that file suffixes 
works well for HTML and XHTML documents.

What I learn from this is that the encoding must be given in a 
cross document compatible format. And file suffixes are a such 
format - it works e.g. with CSS as well.

> Third, there is already widely used mechanism for conveying encoding
> information inside HTML body using <meta charset=..."> (in HTML5) and
> <meta http-equiv="content-type" content="text/html;charset=..."> (in
> "legacy" HTML). Sure this can collide with HTTP headers, but this
> problem is well known and web-masters are somehow trained to cope with it.

Charset suffixes is widely enough used, since long.

But to bridge the gap between offline and online page serving, it 
would be needed that UAs support them via the file:// protocol as 
well.
-- 
leif halvard silli
Received on Friday, 12 June 2009 18:57:45 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:39:04 UTC