W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

"ANSI" encoding Was: how does unix/linux treats non ascii data internally

From: Russ Rolfe <rrolfe@windows.microsoft.com>
Date: Fri, 19 Nov 2004 08:55:31 -0800
Message-ID: <F7B97826912BC4419D5DDF53B21694530A045C00@WIN-MSG-10.wingroup.windeploy.ntdev.microsoft.com>
To: "Martin Duerst" <duerst@w3.org>, "dc" <deepak.rathore@gmail.com>, <www-international@w3.org>

At 10:11 pm 04/11/18, Martin wrote:
 >
 > All MS Windows systems still expose file names in the local (often
not really 
 > correctly called "ANSI") encoding ...

Just curious, what the rest of you use for a generic term for the "ANSI"
encodings.  

Russ

-----Original Message-----
From: www-international-request@w3.org
[mailto:www-international-request@w3.org] On Behalf Of Martin Duerst
Sent: Thursday, November 18, 2004 10:11 PM
To: dc; www-international@w3.org
Subject: Re: how does unix/linux treats non ascii data internally


At 14:10 04/11/19, dc wrote:
 >
 >hi all,
 >
 >In windows, non ascii data is treated as unicode( wide char) ucs-2/
>utf-16. for eg filename

That's true for Windows NT/2000/XP, internally and for the
'wide-character' APIs. As far as I understand, it's not true internally,
although I guess Windows 98 exposes filenames as UCS-2/UTF-16 for those
'wide-character' APIs available on that system.

All MS Windows systems still expose file names in the local (often not
really correctly called "ANSI") encoding for the old
(bytestring-oriented) APIs. This shows up when running software written
for both unix and windows using these APIs.
A typical example would be a cvs client on Windows.

 >how does unix systems treat ???????

Unix treats them as bytes. It has no idea about what the encoding is.
Each user/process can choose an encoding by setting (the encoding
component) of a locale. In the old days, that was fine; everybody in
Japan on Unix machines was using EUC-JP, and nobody else was seeing
these file names. In a networked world, that's no longer the case at
all, so this model doesn't really work anymore, but it's still in wide
use. The tendency (although slow) today is to move towards using UTF-8
for encoding file names. This works quite well in many cases. But it
needs concious decisions, setup, and a bit of user education.

On Macs, I'm not sure what's used internally, but on the Unix side of
Mac OS X, filenames are exposed as UTF-8.
The problem on the Mac is with normalization; in most cases (as far as I
understand except Korean), filenames are decomposed. Depending on the
interface used, that may or may not show up, similar to how differences
in case handling between Windows and Unix systems may or may not show
up.


Regards,     Martin. 
Received on Friday, 19 November 2004 16:56:05 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT