Re: [HTML5] 2.8 Character encodings

On Tue, 04 Aug 2009 12:17:49 +0200, Dr. Olaf Hoffmann  
<Dr.O.Hoffmann@gmx.de> wrote:
> Well, this is a problem for CSS too, because some properties are
> defined differently in CSS2.1 than in CSS2.

Things change. This is not necessarily a problem in my experience.


> I discovered this some time ago for example for clipping for
> some SVG test documents, which appeared wrong in Opera.
> SVG depends on CSS2, therefore these tests are still well
> defined, applied to (X)HTML they are not testable anymore,
> because CSS has no version indication.

This is one way of looking at the problem. Another way of looking at the  
problem is that specifications cannot incrementally evolve like  
implementations do and are therefore not always accurate in what they say.  
Just like normal languages Web languages change now and then.


> For 'HTML5' - as long as I cannot simply write version="HTML5"
> I cannot start to write HTML5 documents.

In effect your documents will be treated as HTML5 regardless.


> Already this is a 'show stopper' for 'HTML5' currently. One can still  
> discuss the
> current draft, but for formal reasons one cannot write a
> 'HTML5' document ;o)

As far as I know there are no formal reasons why one cannot write HTML5  
documents and publish them and in fact many people are authoring HTML5  
documents and publishing them.


> There is no problem to write HTML3.2, HTML4, XHTML1.0,
> XHTML1.1 or XHTML+RDFa, even if for some of them the
> version indication is not very elegant and not very relevant
> for typical user agents.

This assumes versioning is necessary. Experience with Web browsers shows  
that this is not needed and avoids a lot of complexity.


>> No, you can just specify it. Just like you can in HTML4.
>
> I can write the string, but indeed, if I do it, it means 'Windows-1252'.

Not for your authoring tool or a conformance checker.


> Therefore effectively, I cannot indicate, that something is
> 'ISO-8859-1' and not 'Windows-1252'.

You cannot indicate that something needs to be decoded as ISO-8859-1 by  
Web browsers for text/html content. This has been the case for a long time  
and is nothing new.


>>> Therefore if I start to write some test documents and this problem is
>>> not avoided and a version indication is possible, I think, I will use
>>> UTF-8 for those documents.
>>
>> This seems like a good idea regardless.
>
> Sure, if you have no history with thousands of documents or scripts.

More and more programming languages work with Unicode internally. What  
scripts would act up?


>>> Typically this means, that they are
>>> incompatible with other of my documents and scripts and will appear
>>> in another directory with an Apache-.htaccess file indicating the
>>> different encoding.
>>
>> That is one solution. You could also always indicate the encoding in the
>> document instead and instruct Apache to not include the charset  
>> parameter.
>
> Of course, the document should contain it too. However on many servers
> authors have no direct control over the Apache defaults. Therefore it is
> always a good idea to ensure, that this works indepentendly from gags
> of the administrator.

That is not what I'm saying. What I'm saying you could instruct Apache to  
not include the charset parameter so you do not have to maintain a  
document/charset mapping within Apache. Just within the documents.


>> If you simply switch to UTF-8 for all future work this will become less
>> and less of a problem. And then you've also covered other scripts may  
>> the need arise to use them.
>
> For some projects, it may take several years, until I update them
> completely. On one server I still found HTML3.2 documents this
> year ;o)

Yeah, this is nothing new :-) Lots of legacy content out there.


> More often content is just added or minor bugs are fixed.
> I think, this is the same for many authors having already
> thousands of documents around somewhere.

If you are just fixing minor bugs I do not see what HTML5 has to do with  
this.


> Of course if you are 12 to 15 years old, starting your first
> project, you could start just from the beginning with a currently
> proper choice. However, when you start, you don't understand,
> what a proper choice for the future is. Especially for them we
> have to keep things simple to get some better quality of
> documents in the future. Because 'HTML5' works off a lot
> of historical relics and browser bugs, it is not a good
> options for a simple start anyway.

HTML5 also removes a lot of historical relics. E.g. SGML. This simplifies  
things _a lot_ for authors in my opinion and experience in talking with  
Web developers about this. (And feedback I get from collegues who talk to  
Web developers on a near fulltime basis.)


> --- off topic ;o) ---
>
>> Since HTML5 is no longer SGML based entity definitions there will not  
>> work and are non-conforming. The reason we did this was because other  
>> than the validator no software processed text/html resources in this  
>> way leading to a lot of author confusion because of the clear mismatch  
>> between the
>> validator and other software.
>
> SVG tiny 1.2 documents have typically no doctype, but for this purpose,
> it is still pretty useful and an SVG/XML-parser interpretes this.
> Because for 'HTML5' it is possible to use an XML-parser, it should
> be possible to use this important feature too. Of course, because
> it is XML, one can simply start to mix with elements from other
> languages too, if this appears to be more convenient as to use
> microdata to indicate 'HTML5' elements or their content to
> represent the same meaning as those elements from other
> namespaces.

Ah. I did not realize you were talking about XHTML5 (HTML5 expressed in  
XML). In XHTML5 ISO-8859-1 just means ISO-8859-1, not Windows-1252. That  
is just done for text/html documents. You can indeed use DOCTYPE features  
in XHTML5 although I believe it is currently recommended that you do not  
use them.


-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Tuesday, 4 August 2009 11:37:34 UTC