W3C home > Mailing lists > Public > public-html-comments@w3.org > August 2009

Re: [HTML5] 2.8 Character encodings

From: Anne van Kesteren <annevk@opera.com>
Date: Tue, 04 Aug 2009 11:12:09 +0200
To: "Dr. Olaf Hoffmann" <Dr.O.Hoffmann@gmx.de>, public-html-comments@w3.org
Message-ID: <op.ux4ruje764w2qv@anne-van-kesterens-macbook.local>
On Tue, 04 Aug 2009 10:25:35 +0200, Dr. Olaf Hoffmann  
<Dr.O.Hoffmann@gmx.de> wrote:
> [snip] I think, up to know, it has not even a version indication,  
> therefore it
> is not obvious to me how to indicate, that a document is written in
> 'HTML5'.

This is by design. We're removing versioning from (X)HTML much like CSS  
does not have versioning. (To be clear, not everyone in the HTML WG agrees  
with this design choice.)

> But as already mentioned, for an author of an 'ISO-8859-1'-'HTML5'
> document apart from the version indication it is already a problem to
> specify the used encoding properly.

No, you can just specify it. Just like you can in HTML4.

> This problem appears while a document is written and has to be solved  
> before publication, therefore
> published documents are not broken, because they simply are not
> published due to this problem.

I do not follow this.

> Therefore if I start to write some test documents and this problem is
> not avoided and a version indication is possible, I think, I will use
> UTF-8 for those documents.

This seems like a good idea regardless.

> Typically this means, that they are
> incompatible with other of my documents and scripts and will appear
> in another directory with an Apache-.htaccess file indicating the
> different encoding.

That is one solution. You could also always indicate the encoding in the  
document instead and instruct Apache to not include the charset parameter.

> I think, the Apache has an option with specific file name extensions too,
> this can be used for directories with mixed encodings maybe.

That is an option too. You can also set headers on a per-file basis using  
the Files directive.

> Surely I will not explain this to other authors, if this question comes  
> up, because it is too complex for many authors.

Agreed. Encoding is largely misunderstood. It makes more sense for editors  
to start defaulting to UTF-8 going forward and have everyone use that, in  
my opinion.

> This does not cause broken documents, the construct is just more fragile
> and one has to care more, where to put and how to name files and one
> has to switch the encoding in the editor for different projects. This is
> only more work and more sources of possible errors, not recommendable
> for every author.

If you simply switch to UTF-8 for all future work this will become less  
and less of a problem. And then you've also covered other scripts may the  
need arise to use them.

> Therefore maybe I will never create more than test documents for
> 'HTML5' just to avoid such complications.


> With the new microdata section, 'HTML5' seemed to get more
> interesting for authors (well, the CURIEs are still missing, but there
> seems to be a workaround with entitiy definitions within the else
> almost empty DOCTYPE), therefore it would have been interesting
> to test this or to include this in tutorials for other authors, because
> it has already a few more semantically relevant elements than

Since HTML5 is no longer SGML based entity definitions there will not work  
and are non-conforming. The reason we did this was because other than the  
validator no software processed text/html resources in this way leading to  
a lot of author confusion because of the clear mismatch between the  
validator and other software.

Anne van Kesteren
Received on Tuesday, 4 August 2009 09:12:58 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 20:03:57 UTC