Re: [HTML5] 2.8 Character encodings from Dr. Olaf Hoffmann on 2009-08-04 (public-html-comments@w3.org from August 2009)

From: Dr. Olaf Hoffmann <Dr.O.Hoffmann@gmx.de>
Date: Tue, 4 Aug 2009 15:45:36 +0200
To: public-html-comments@w3.org
Message-Id: <200908041545.37002.Dr.O.Hoffmann@gmx.de>
Julian Reschke:
> Dr. Olaf Hoffmann wrote:
> > ...
> > I can write the string, but indeed, if I do it, it means 'Windows-1252'.
> > Therefore effectively, I cannot indicate, that something is
> > 'ISO-8859-1' and not 'Windows-1252'.
> > ...
>
> Olaf, from what you write it's not totally clear that you realize that
> ISO-8859-1 is a proper subset of Windows-1252?
>

This seems to fit to what is noted for example at wikipedia
for ISO/IEC 8859-1, not for ISO-8859-1, these are
different for some control characters.
Therefore ISO-8859-1 and Windows-1252 seem
to be a superset of ISO/IEC 8859-1, but ISO-8859-1
is not a subset of Windows-1252, but the conflicting
characters are typically not used in documents with
correctly indicated ISO-8859-1 encoding.
http://en.wikipedia.org/wiki/ISO/IEC_8859-1.

What I personally use in (X)HTML is typically the ASCII 
subset, where I do not even have to care about differences between
'ISO-8859-1' and 'UTF-8' - if a server administrator (or a 
browser implementor) decides something surprising. 
However, because masking of special characters like
Umlaute and the ß-ligature is not always available in
the (X)HTML style and for example for an XML parser like 
Opera it seems to depend on the XHTML-version/doctype,
if the (X)HTML predefined entities are known, I have to
switch to more critical things.
Many other others rely already for many years on unmasked
special characters. And if they want to use and indicate 
'ISO-8859-1', this should be possible. No problem too, if
they want to use and indicate 'Windows-1252'
as it is not problem to use and indicate 'UTF-8' - but
mixing up this in a specification means basically confusion
for some authors reading this, especially for those, who
already have problems to indicate the encoding they
used properly.
Therefore it is not a big practical problem in what browsers
currently do, if 'ISO-8859-1' is specified (if 'ISO-8859-1'
is used). It is more, that readers of the draft are 
confused by the wording. 
If it is noted something like: "If 'ISO-8859-1' etc is
indicated for a document encoding, a HTML5 praser 
will/may use for the presentable characters
'Windows-1252' for decoding."
This would be already less confusing and does not change
the meaning of the string or document, it describes
only the behaviour or the parser, what is a difference.
For more advanced authors one may add something
like "Due to this rule, there may be no indication of
a wrong encoding information. Some not presentable control
characters of 'ISO-8859-1' might be presented as 
presentable characters according to Windows-1252."
This indicates maybe the intended reason for this
behaviour and indicates too, that such parsers should
not be used to check proper encoding/decoding 
(what is nevertheless done by many authors with
known consequences ;o)


> So the only difference would be the ability to diagnose problems in
> documents that claim to be ISO-8859-1, but actually use C1 control codes.
>
> That being said, I do agree with Larry that the spec should phrase it
> differently.
>
> BR, Julian
Received on Tuesday, 4 August 2009 15:12:47 UTC