W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: how does unix/linux treats non ascii data internally

From: dc <deepak.rathore@gmail.com>
Date: Mon, 22 Nov 2004 10:48:28 +0530
Message-ID: <8567ac78041121211874f4ebf4@mail.gmail.com>
To: Martin Duerst <duerst@w3.org>
Cc: www-international@w3.org

Thanks Martin.
yes, only windows built on nt 4.0 technology supports unicode internally.
not win95,98

wrt to unix, according to u whatever i have experenced , non ascii
data is treated as byte by byte.

But in hp site i found the following; so got really confused about unix :
TRU64 UNIX Characters are processed internally using a 32-bit wchar_t data type
http://h30097.www3.hp.com/unix/i18n.htm#single	

Any ideas on this or only unix flavour tru64 treats data as wide char......

Thanks
DC


On Fri, 19 Nov 2004 15:11:03 +0900, Martin Duerst <duerst@w3.org> wrote:
> At 14:10 04/11/19, dc wrote:
>  >
>  >hi all,
>  >
>  >In windows, non ascii data is treated as unicode( wide char) ucs-2/
>  >utf-16. for eg filename
> 
> That's true for Windows NT/2000/XP, internally and for the
> 'wide-character' APIs. As far as I understand, it's not true
> internally, although I guess Windows 98 exposes
> filenames as UCS-2/UTF-16 for those 'wide-character' APIs
> available on that system.
> 
> All MS Windows systems still expose file names in the local
> (often not really correctly called "ANSI") encoding for the
> old (bytestring-oriented) APIs. This shows up when running
> software written for both unix and windows using these APIs.
> A typical example would be a cvs client on Windows.
> 
>  >how does unix systems treat ???????
> 
> Unix treats them as bytes. It has no idea about what the
> encoding is. Each user/process can choose an encoding by
> setting (the encoding component) of a locale. In the old
> days, that was fine; everybody in Japan on Unix machines
> was using EUC-JP, and nobody else was seeing these file
> names. In a networked world, that's no longer the case at
> all, so this model doesn't really work anymore, but it's
> still in wide use. The tendency (although slow) today is
> to move towards using UTF-8 for encoding file names. This
> works quite well in many cases. But it needs concious
> decisions, setup, and a bit of user education.
> 
> On Macs, I'm not sure what's used internally, but on the
> Unix side of Mac OS X, filenames are exposed as UTF-8.
> The problem on the Mac is with normalization; in most
> cases (as far as I understand except Korean), filenames
> are decomposed. Depending on the interface used, that
> may or may not show up, similar to how differences in
> case handling between Windows and Unix systems may or
> may not show up.
> 
> Regards,     Martin.
> 
>
Received on Monday, 22 November 2004 05:18:29 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT