RE: Equality from Luc Rooijakkers on 1993-08-06 (ietf-charsets@w3.org from July to September 1993)

From: Luc Rooijakkers <luc@opus.spc.nl>
Date: Fri, 06 Aug 1993 18:53:56 +0100
To: Masataka Ohta <mohta@necom830.cc.titech.ac.jp>
Cc: Luc Rooijakkers <lwj@cs.kun.nl>, ietf-charsets@INNOSOFT.COM
Message-id: <9308061753.AA00283@opus.spc.nl>

> > > It is very useful, if the uniqueness is not achievable, to have
> > > some short notation of regular expressions to represent all the
> > > equivalent characters.
> > 
> > This merely moves the burden up to the user, to type that regular
> > expression,
> 
> That's why a *SHORT* notation is very useful.

I still don't see the advantages of this over canonicalization at lower
layers, for the class of characters that you want to be canonicalized.
Also, since the sender presumably knows the character sets it is
using, it has a much better chance of having the appropriate information
than the receiver.

> > I found it very illuminating that the June 1993 version of ECMA-35, to
> > be proposed to ISO as a new edition of 2022, requires the lowest
> > numbered of G0/G1/G2/G3 to be used when a character is present in
> > multiple sets, *even if a higher numbered set  is already invoked
> > and the lowest numbered set is not* (clause 7.5). This amounts to a
> > version of uniqueness.
> 
> I don't think it any useful.

I wasn't implying anything about its usefulness, but merely pointing out
that ECMA people have apparently feld the need for such a mechanism,
which means that we don't need to feel bad about it, should we feel the need.
Of course, this is second-guessing their intentions; it would be useful
if someone could dig up the reasons for this change.

> As the code points of some character varies with no regularlity in
> different character sets, it means you must have a table. And, if
> you have such a table, it is not at all difficult for the receiver
> side to use the table to disambigufy a character with multiple
> representations.

If the *sender* has a table, because is uses multiple character sets,
this does not mean that the *receiver* has that table, too. 

> And, do you think 'A' in JIS X0208 is identical to 'A' in ASCII?
> Do you think Han characters of GB, CNS, JIS, KCS unified in ISO 10646
> the same cahracters?

As you have very vocally pointed out on numerous mailing lists, there
are apparently people that don't think the Han unification was a good
idea. Since I know next to nothing about eastern languages, I have no
opinion on the matter.

However, for ASCII this is a different case. I was under the impression
that the ASCII "subsets" of GB, JIS, KCS, etc were in fact identical in
meaning to the corresponding ASCII characters. Could you clarify why you
believe this not to be the case?

--
Luc Rooijakkers                                 Internet: lwj@cs.kun.nl
SPC Company, the Netherlands                    UUCP: uunet!cs.kun.nl!lwj

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Friday, 6 August 1993 10:43:57 UTC