W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: how does unix/linux treats non ascii data internally

From: Martin Duerst <duerst@w3.org>
Date: Fri, 19 Nov 2004 15:11:03 +0900
Message-Id: <>
To: dc <deepak.rathore@gmail.com>, www-international@w3.org

At 14:10 04/11/19, dc wrote:
 >hi all,
 >In windows, non ascii data is treated as unicode( wide char) ucs-2/
 >utf-16. for eg filename

That's true for Windows NT/2000/XP, internally and for the
'wide-character' APIs. As far as I understand, it's not true
internally, although I guess Windows 98 exposes
filenames as UCS-2/UTF-16 for those 'wide-character' APIs
available on that system.

All MS Windows systems still expose file names in the local
(often not really correctly called "ANSI") encoding for the
old (bytestring-oriented) APIs. This shows up when running
software written for both unix and windows using these APIs.
A typical example would be a cvs client on Windows.

 >how does unix systems treat ???????

Unix treats them as bytes. It has no idea about what the
encoding is. Each user/process can choose an encoding by
setting (the encoding component) of a locale. In the old
days, that was fine; everybody in Japan on Unix machines
was using EUC-JP, and nobody else was seeing these file
names. In a networked world, that's no longer the case at
all, so this model doesn't really work anymore, but it's
still in wide use. The tendency (although slow) today is
to move towards using UTF-8 for encoding file names. This
works quite well in many cases. But it needs concious
decisions, setup, and a bit of user education.

On Macs, I'm not sure what's used internally, but on the
Unix side of Mac OS X, filenames are exposed as UTF-8.
The problem on the Mac is with normalization; in most
cases (as far as I understand except Korean), filenames
are decomposed. Depending on the interface used, that
may or may not show up, similar to how differences in
case handling between Windows and Unix systems may or
may not show up.

Regards,     Martin. 
Received on Friday, 19 November 2004 13:39:52 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:24 UTC