Re: Windows and Mac character encoding questions from Chris Wendt on 1998-08-28 (www-international@w3.org from July to September 1998)

From: Chris Wendt <christw@microsoft.com>
Date: Fri, 28 Aug 1998 10:05:57 -0700
To: "Deke Smith" <deke@tallent.com>, <www-international@w3.org>
Message-ID: <00a801bdd2a6$1fc12e40$ec86389d@christw2.dns.microsoft.com>
The difference between Windows-1252 and iso-8859-1 is that in iso-8859-1 the
code
points 0x80 to 0x9F are reserved. In Windows-1252 most of the 0x80 to 0x9F
code points map to characters, among them the Euro currency sign at code
point 0x80.

All code points outside 0x80 to 0x9F are shared between iso-8859-1 and
Windows-1252.
Best practice is to label the document as iso-8859-1 unless it contains the
characters at code points 0x80 to 0x9F.
Windows-1252 IANA registration is requested with the charset registrar. The
newer versions of the two leading browsers and associated email programs
recognize the "Windows-1252" label. It was an oversight on my part to not
register Windows-1252 originally with the other Windows-125x registrations
:-(

Here is a table of the code points that differ between iso-8859-1 and
Windows-1252 and the Unicode character they map to:

0x80 0x20AC #EURO SIGN
0x82 0x201A #SINGLE LOW-9 QUOTATION MARK
0x83 0x0192 #LATIN SMALL LETTER F WITH HOOK
0x84 0x201E #DOUBLE LOW-9 QUOTATION MARK
0x85 0x2026 #HORIZONTAL ELLIPSIS
0x86 0x2020 #DAGGER
0x87 0x2021 #DOUBLE DAGGER
0x88 0x02C6 #MODIFIER LETTER CIRCUMFLEX ACCENT
0x89 0x2030 #PER MILLE SIGN
0x8A 0x0160 #LATIN CAPITAL LETTER S WITH CARON
0x8B 0x2039 #SINGLE LEFT-POINTING ANGLE QUOTATION MARK
0x8C 0x0152 #LATIN CAPITAL LIGATURE OE
0x8E 0x017D #LATIN CAPITAL LETTER Z WITH CARON
0x91 0x2018 #LEFT SINGLE QUOTATION MARK
0x92 0x2019 #RIGHT SINGLE QUOTATION MARK
0x93 0x201C #LEFT DOUBLE QUOTATION MARK
0x94 0x201D #RIGHT DOUBLE QUOTATION MARK
0x95 0x2022 #BULLET
0x96 0x2013 #EN DASH
0x97 0x2014 #EM DASH
0x98 0x02DC #SMALL TILDE
0x99 0x2122 #TRADE MARK SIGN
0x9A 0x0161 #LATIN SMALL LETTER S WITH CARON
0x9B 0x203A #SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
0x9C 0x0153 #LATIN SMALL LIGATURE OE
0x9E 0x017E #LATIN SMALL LETTER Z WITH CARON
0x9F 0x0178 #LATIN CAPITAL LETTER Y WITH DIAERESIS

You can find the complete table at
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
as well as definitions to the other code pages.


>Does Windows 3.x/DOS use the same encoding as Windows 95/98?
Roughly: yes. Note DOS and the Win9x MS-DOS prompt use the so called "OEM"
code page. For European charsets these are 3-digit numbers like 437, 850,
850, 863 and so on. This is not the same as the mis-named "ANSI" code page
which is exposed to Windows applications.
On Asian versions the OEM and "ANSI" code pages are the same.

Yes, the 1252 code page of Windows 98 has some more characters than original
Windows 95 and Windows 3.1. Most importantly the Windows 98 version has a
place for the Euro currency sign and positions for upper and lowercase Z
with caron.

Both Windows 9x and WIndows NT handle Unicode and Multibyte code page in
parallel and offer a number of conversion functions. However, most of the
system APIs on Win9x take only Multibyte parameters whereas NT offers both
versions for all system APIs. A good overview of Win9x Unicode capabilities
gives the article "Yes, Virginia, Windows 95 does Unicode" on the Microsoft
Developer Network CD.


-----Original Message-----
From: Deke Smith <deke@tallent.com>
To: www-international@w3.org <www-international@w3.org>
Date: Friday, August 28, 1998 8:25 AM
Subject: Windows and Mac character encoding questions


>I have seen some contradictory information about the character encoding
>for Windows text.
>
>One source said that Windows uses ISO-8859-1 for its English-language
>system, then I saw a thread about the Windows-1252 encoding and how it
>differs from ISO-8859-1.
>
>Does Windows 3.x/DOS use the same encoding as Windows 95/98? I have read
>that WinNT uses Unicode, but is the default encoding under the English
>language system different than the other flavors of Win/DOS? IANA lists
>"Windows-1250", "Windows-1254", etc. but does not list our friend,
>"Windows-1252".
>
>On the Mac, the English encoding is called "MacRoman" by the browsers,
>news clients and email clients. IANA does not list "MacRoman" as an
>encoding scheme, instead it lists, "Macintosh". Which is the acceptable
>usage?
>
>I'm using as my IANA reference
>ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets
>
>
>
>Just a little confused....
>
>-----------------------------------------------------------------
>Deke Smith
>Tallent Communications Group, Brentwood TN
>deke@tallent.com, 615-661-9878
>-----------------------------------------------------------------
>" The best way to predict the future is to invent it. "
>       - Alan Kay
>
>
Received on Friday, 28 August 1998 13:05:44 UTC