RE: accented characters, etc. from Karlsson Kent - keka on 1999-12-03 (www-html@w3.org from December 1999)

From: Karlsson Kent - keka <keka@im.se>
Date: Fri, 3 Dec 1999 23:11:46 +0100
To: "'Murray Altheim'" <altheim@eng.sun.com>, John Delacour <JD@EREMITA.demon.co.uk>
Cc: www-html@w3.org
Message-ID: <C110A2268F8DD111AA1A00805F85E58DA68486@ntgbg1>

> -----Original Message-----
> From: Murray Altheim [mailto:altheim@eng.sun.com]
> Sent: Friday, December 03, 1999 12:23 PM
> To: John Delacour
> Cc: www-html@w3.org
> Subject: Re: accented characters, etc.
> 
> 
> John Delacour wrote:
> > 
> > After all Unicode itself is an ISO standard.
> 
> No, it's a product of the Unicode Consortium. There are attempts at 
> keeping ISO 10646 in line (so it is similar to Unicode but generally
> not identical), but the Unicode standard is not an ISO standard.

To be nitpicking:  Unicode 3.0 and ISO/IEC 10646-1:2000 have EXACTLY
the same characters at the same code points.  (Unicode 2.1 is a bit
harder to pinpoint relative to 10646-1:1993: Amd.1-7 plus two more
characters from a later amendment).

Unicode defines in addition to characters at code points also character
properties, and the BiDi algorithm.  These are not part of 10646 yet.
Furthermore Unicode defines canonical and compatibility mappings,
as character properties, and a normalisation algorithm.

So, just looking at characters at code points, Unicode 3.0 and 10646
in its year 2000 incarnation are identical.  Beyond code point allocations
there are differences, mainly that Unicode normatively specifies things
that 10646 (yet) does not speak of.  There are some other points as
well, which I will not bore you with.

I would predict that Unicode and 10646, at main synchronisation points,
will remain identical regarding character allocations.  But it is correct
that 10646 and Unicode are not the same otherwise, and might not
become the same for quite a while, if ever.

	/Kent Karlsson

PS
Regarding HTML character entities: Please instead use the proper
characters directly whenever possible.  NCRs and named characters
should be used ONLY when you cannot express the proper character
directy in the encoding used for the document.  So if the document is
in UTF-8 you never really need any NCRs or named character entities.

Received on Friday, 3 December 1999 17:14:05 UTC