RE: "ANSI" encoding Was: how does unix/linux treats non ascii data internally from Addison Phillips [wM] on 2004-11-19 (www-international@w3.org from October to December 2004)

From: Addison Phillips [wM] <aphillips@webmethods.com>
Date: Fri, 19 Nov 2004 10:14:51 -0800
To: "Russ Rolfe" <rrolfe@windows.microsoft.com>, "Martin Duerst" <duerst@w3.org>, <www-international@w3.org>
Message-ID: <PNEHIBAMBMLHDMJDDFLHIEDLINAA.aphillips@webmethods.com>
Microsoft's terminology makes sense... for Microsoft. The ANSI and OEM code pages are a historical curiousity and their names are not strictly accurate in modern Windows. The "OEM" code page made sense back in the day that the code page was defined in the hardware. It is slightly more accurate to call it the "DOS code page"--in practice it is the command shell code page.

The ANSI code page was never an accurate name, as near as I can tell. The American National Standards Institute did not define code page 1252 (and certainly not the majority of other "ANSI" code pages). It's just that, back in the day, the "ANSI" code page that is now 1252 was based on ISO 8859-1 (it might even be more accurate to say that 8859-1 is based on an old ANSI standard).

Microsoft has used these terms consistently for lo the past 15 or more years. When one talks to a Windows programmer, one must use the terms... but it is the source of neverending confusion for those folks, since there is also Unicode which is neither and... oh botheration.

So.... all that preamble to say that I lump both together under the rubric of "legacy encodings" and then go on to say that the "command shell code page" (what MS calls OEM) and the "Windows code page" (what MS calls ANSI) are sometimes different from one another.

I'll go on to add that the fact that they are different is a neverending source of woe for our Installer writers. Batch files ("shell scripts") are dependent on the command shell code page in unsavory ways and this is especially annoying for Java programs.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -----Original Message-----
> From: www-international-request@w3.org 
> [mailto:www-international-request@w3.org]On Behalf Of Russ Rolfe
> Sent: 2004年11月19日 8:56
> To: Martin Duerst; dc; www-international@w3.org
> Subject: "ANSI" encoding Was: how does unix/linux treats non 
> ascii data internally
> 
> 
> 
> At 10:11 pm 04/11/18, Martin wrote:
>  >
>  > All MS Windows systems still expose file names in the local (often
> not really 
>  > correctly called "ANSI") encoding ...
> 
> Just curious, what the rest of you use for a generic term for the "ANSI"
> encodings.  
> 
> Russ
> 
> -----Original Message-----
> From: www-international-request@w3.org
> [mailto:www-international-request@w3.org] On Behalf Of Martin Duerst
> Sent: Thursday, November 18, 2004 10:11 PM
> To: dc; www-international@w3.org
> Subject: Re: how does unix/linux treats non ascii data internally
> 
> 
> At 14:10 04/11/19, dc wrote:
>  >
>  >hi all,
>  >
>  >In windows, non ascii data is treated as unicode( wide char) ucs-2/
> >utf-16. for eg filename
> 
> That's true for Windows NT/2000/XP, internally and for the
> 'wide-character' APIs. As far as I understand, it's not true internally,
> although I guess Windows 98 exposes filenames as UCS-2/UTF-16 for those
> 'wide-character' APIs available on that system.
> 
> All MS Windows systems still expose file names in the local (often not
> really correctly called "ANSI") encoding for the old
> (bytestring-oriented) APIs. This shows up when running software written
> for both unix and windows using these APIs.
> A typical example would be a cvs client on Windows.
> 
>  >how does unix systems treat ???????
> 
> Unix treats them as bytes. It has no idea about what the encoding is.
> Each user/process can choose an encoding by setting (the encoding
> component) of a locale. In the old days, that was fine; everybody in
> Japan on Unix machines was using EUC-JP, and nobody else was seeing
> these file names. In a networked world, that's no longer the case at
> all, so this model doesn't really work anymore, but it's still in wide
> use. The tendency (although slow) today is to move towards using UTF-8
> for encoding file names. This works quite well in many cases. But it
> needs concious decisions, setup, and a bit of user education.
> 
> On Macs, I'm not sure what's used internally, but on the Unix side of
> Mac OS X, filenames are exposed as UTF-8.
> The problem on the Mac is with normalization; in most cases (as far as I
> understand except Korean), filenames are decomposed. Depending on the
> interface used, that may or may not show up, similar to how differences
> in case handling between Windows and Unix systems may or may not show
> up.
> 
> 
> Regards,     Martin. 
> 
>
Received on Friday, 19 November 2004 18:16:06 UTC