RE: "ANSI" encoding Was: how does unix/linux treats non ascii data internally

"Regional character set" or
"Regional encoding" 

-Paul

-----Original Message-----
From: www-international-request@w3.org [mailto:www-international-request@w3.org] On Behalf Of Russ Rolfe
Sent: Friday, November 19, 2004 8:56 AM
To: Martin Duerst; dc; www-international@w3.org
Subject: "ANSI" encoding Was: how does unix/linux treats non ascii data internally


At 10:11 pm 04/11/18, Martin wrote:
 >
 > All MS Windows systems still expose file names in the local (often not really  > correctly called "ANSI") encoding ...

Just curious, what the rest of you use for a generic term for the "ANSI"
encodings.  

Russ

-----Original Message-----
From: www-international-request@w3.org
[mailto:www-international-request@w3.org] On Behalf Of Martin Duerst
Sent: Thursday, November 18, 2004 10:11 PM
To: dc; www-international@w3.org
Subject: Re: how does unix/linux treats non ascii data internally


At 14:10 04/11/19, dc wrote:
 >
 >hi all,
 >
 >In windows, non ascii data is treated as unicode( wide char) ucs-2/
>utf-16. for eg filename

That's true for Windows NT/2000/XP, internally and for the 'wide-character' APIs. As far as I understand, it's not true internally, although I guess Windows 98 exposes filenames as UCS-2/UTF-16 for those 'wide-character' APIs available on that system.

All MS Windows systems still expose file names in the local (often not really correctly called "ANSI") encoding for the old
(bytestring-oriented) APIs. This shows up when running software written for both unix and windows using these APIs.
A typical example would be a cvs client on Windows.

 >how does unix systems treat ???????

Unix treats them as bytes. It has no idea about what the encoding is.
Each user/process can choose an encoding by setting (the encoding
component) of a locale. In the old days, that was fine; everybody in Japan on Unix machines was using EUC-JP, and nobody else was seeing these file names. In a networked world, that's no longer the case at all, so this model doesn't really work anymore, but it's still in wide use. The tendency (although slow) today is to move towards using UTF-8 for encoding file names. This works quite well in many cases. But it needs concious decisions, setup, and a bit of user education.

On Macs, I'm not sure what's used internally, but on the Unix side of Mac OS X, filenames are exposed as UTF-8.
The problem on the Mac is with normalization; in most cases (as far as I understand except Korean), filenames are decomposed. Depending on the interface used, that may or may not show up, similar to how differences in case handling between Windows and Unix systems may or may not show up.


Regards,     Martin. 

Received on Friday, 19 November 2004 17:00:35 UTC