W3C home > Mailing lists > Public > public-html-comments@w3.org > August 2009

Re: [HTML5] 2.8 Character encodings

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Tue, 4 Aug 2009 12:17:49 +0200
To: public-html-comments@w3.org
Message-Id: <200908041217.49148.Dr.O.Hoffmann@gmx.de>
Anne van Kesteren:
> On Tue, 04 Aug 2009 10:25:35 +0200, Dr. Olaf Hoffmann
> <Dr.O.Hoffmann@gmx.de> wrote:
> > [snip] I think, up to know, it has not even a version indication,
> > therefore it
> > is not obvious to me how to indicate, that a document is written in
> > 'HTML5'.
> This is by design. We're removing versioning from (X)HTML much like CSS
> does not have versioning. (To be clear, not everyone in the HTML WG agrees
> with this design choice.)

Well, this is a problem for CSS too, because some properties are
defined differently in CSS2.1 than in CSS2. 
I discovered this some time ago for example for clipping for
some SVG test documents, which appeared wrong in Opera.
SVG depends on CSS2, therefore these tests are still well
defined, applied to (X)HTML they are not testable anymore,
because CSS has no version indication.

For 'HTML5' - as long as I cannot simply write version="HTML5"
I cannot start to write HTML5 documents. Already this is a
'show stopper' for 'HTML5' currently. One can still discuss the
current draft, but for formal reasons one cannot write a
'HTML5' document ;o)
There is no problem to write HTML3.2, HTML4, XHTML1.0,
XHTML1.1 or XHTML+RDFa, even if for some of them the
version indication is not very elegant and not very relevant
for typical user agents.

> > But as already mentioned, for an author of an 'ISO-8859-1'-'HTML5'
> > document apart from the version indication it is already a problem to
> > specify the used encoding properly.
> No, you can just specify it. Just like you can in HTML4.

I can write the string, but indeed, if I do it, it means 'Windows-1252'.
Therefore effectively, I cannot indicate, that something is 
'ISO-8859-1' and not 'Windows-1252'.

> > This problem appears while a document is written and has to be solved
> > before publication, therefore
> > published documents are not broken, because they simply are not
> > published due to this problem.
> I do not follow this.
> > Therefore if I start to write some test documents and this problem is
> > not avoided and a version indication is possible, I think, I will use
> > UTF-8 for those documents.
> This seems like a good idea regardless.

Sure, if you have no history with thousands of documents or scripts.

> > Typically this means, that they are
> > incompatible with other of my documents and scripts and will appear
> > in another directory with an Apache-.htaccess file indicating the
> > different encoding.
> That is one solution. You could also always indicate the encoding in the
> document instead and instruct Apache to not include the charset parameter.

Of course, the document should contain it too. However on many servers
authors have no direct control over the Apache defaults. Therefore it is
always a good idea to ensure, that this works indepentendly from gags
of the administrator.

> > I think, the Apache has an option with specific file name extensions too,
> > this can be used for directories with mixed encodings maybe.
> That is an option too. You can also set headers on a per-file basis using
> the Files directive.
> > Surely I will not explain this to other authors, if this question comes
> > up, because it is too complex for many authors.
> Agreed. Encoding is largely misunderstood. It makes more sense for editors
> to start defaulting to UTF-8 going forward and have everyone use that, in
> my opinion.
> > This does not cause broken documents, the construct is just more fragile
> > and one has to care more, where to put and how to name files and one
> > has to switch the encoding in the editor for different projects. This is
> > only more work and more sources of possible errors, not recommendable
> > for every author.
> If you simply switch to UTF-8 for all future work this will become less
> and less of a problem. And then you've also covered other scripts may the
> need arise to use them.

For some projects, it may take several years, until I update them 
completely. On one server I still found HTML3.2 documents this
year ;o)
More often content is just added or minor bugs are fixed.
I think, this is the same for many authors having already 
thousands of documents around somewhere.

Of course if you are 12 to 15 years old, starting your first
project, you could start just from the beginning with a currently
proper choice. However, when you start, you don't understand,
what a proper choice for the future is. Especially for them we
have to keep things simple to get some better quality of 
documents in the future. Because 'HTML5' works off a lot
of historical relics and browser bugs, it is not a good 
options for a simple start anyway.

> > Therefore maybe I will never create more than test documents for
> > 'HTML5' just to avoid such complications.
> Ok.
> > With the new microdata section, 'HTML5' seemed to get more
> > interesting for authors (well, the CURIEs are still missing, but there
> > seems to be a workaround with entitiy definitions within the else
> > almost empty DOCTYPE), therefore it would have been interesting
> > to test this or to include this in tutorials for other authors, because
> > it has already a few more semantically relevant elements than
> > HTML4/XHTML1.x.

--- off topic ;o) ---

> Since HTML5 is no longer SGML based entity definitions there will not work
> and are non-conforming. The reason we did this was because other than the
> validator no software processed text/html resources in this way leading to
> a lot of author confusion because of the clear mismatch between the
> validator and other software.

SVG tiny 1.2 documents have typically no doctype, but for this purpose, 
it is still pretty useful and an SVG/XML-parser interpretes this.
Because for 'HTML5' it is possible to use an XML-parser, it should
be possible to use this important feature too. Of course, because
it is XML, one can simply start to mix with elements from other
languages too, if this appears to be more convenient as to use
microdata to indicate 'HTML5' elements or their content to 
represent the same meaning as those elements from other 
For those microdata information one mainly has to process this
(as CURIEs in XHTML+RDFa), if the semantical meaning gets 
relevant, not for presentation in a simple browser. For them one
needs maybe only the attribute values to identify the elements
for styling with CSS.
And for example if these constructions refer to a human readable
specification (as I have created for literature currently), it is not
obvious, what a simple browser should do with this information.
This is the general problem with this approach filling semantical
gaps of languages with those extensions. For a presentation
with simple browsers an author cannot derive much more than
some styling. But if a language like (X)HTML has these gaps,
this is currently the only way to combine the technical functionality
of elements with detailed semantical meanings. 
And because the technical functionalities of (X)HTML and SVG
are already interpreted in larger parts in common browsers, 
formats/versions like XHTML+RDFa, SVG tiny 1.2, 'HTML5' 
are interesting to explore methods how to provide both technical
functionalities and semantical meanings.
But if there are problems to indicate the encoding or the
version of the used format, this is contraproductive for authors
wanting to create long living and well defined and semantical
rich documents, not because there are specific presentation
problems in current browsers, but because one simply
cannot indicate, what one is currently doing - and one
self in ten years or others even later have to identify this.

Received on Tuesday, 4 August 2009 11:10:10 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 June 2011 00:14:00 GMT