W3C home > Mailing lists > Public > whatwg@whatwg.org > June 2007

[whatwg] ISO-8859-* and the C1 control range

From: Řistein E. Andersen <html5@xn--istein-9xa.com>
Date: Sat, 02 Jun 2007 14:58:23 +0200
Message-ID: <E1HuTBX-000HUN-CT@node1-5.ouvaton.local>
On 29 May 2007, at 11:13AM, Henri Sivonen wrote:
> Surely there are other ISO-8859 family encodings besides ISO-8859-1
> that require decoding using the corresponding Windows-* family decoder?

For the following reasons, this is not entirely obvious:

1) Several of the windows-* encodings are more or less incompatible with (i.e.,
they are not a superset of) the corresponding ISO-8859-* encoding;
2) Only ISO-8859-1 enjoyed a privileged position as standard HTML encoding;
3) Windows-1252 was registered in IANA at a later time than the other Windows-*
encodings.

On 1 Jun 2007, at 8:57AM, Henri Sivonen wrote:
> 2) 0x85 in ISO-8859-10 and in ISO-8859-16 is decoded as in Windows-1252
> (ellipsis) by Gecko.

I am unable to reproduce this in Firefox (1.5 Mac, 2.0 Unix, 3.0 Mac). However,
C1 characters in ISO-8859-10 and ISO-8859-16 are /not/ converted to U+FFFD,
and this may give the reported result with an incorrectly encoded font
containing the ellipsis at unicode U+0085.
(Cf. http://html5.ouvaton.org/iso-8859-16.png for an example of this with
accented small capitals as intruders). Would this be the explanation?

On 29 May 2007, at 4:10PM, Maciej Stachowiak wrote:
> for all unicode encodings and numeric entity references compatibility requires
> interpreting this range of code points in the WinLatin1 way.

On 1 Jun 2007, at 8:57AM, Henri Sivonen wrote:
> 1) ISO-8859-1 is decoded as Windows-1252.
> 3) ISO-8859-11 is decoded as Windows-874.
> I suggest adding the ISO-8859-11 to Windows-874 mapping to the spec.

1) The C1-range characters defined in Windows-874 seem to be a subset of those
defined in Windows-1252;
2) Safari and IE5.5/Mac treat C1 characters from all (supported) ISO-8859-*
characters as Windows-1252;
3) IE7 does the same for a certain number of selected ISO-8859-* encodings. 

As suggested earlier [1], a simpler solution seems to be to treat C1 bytes and
NCRs from /all/ ISO-8859-* and Unicode encodings as Windows-1252.

[1] http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2006-November/007804.html

-- 
?istein E. Andersen
Received on Saturday, 2 June 2007 05:58:23 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 22 January 2020 16:58:56 UTC