RE: Equality from Masataka Ohta on 1993-08-15 (ietf-charsets@w3.org from July to September 1993)

From: Masataka Ohta <mohta@necom830.cc.titech.ac.jp>
Date: Sun, 15 Aug 1993 10:48:55 +0900 (JST)
To: luc@opus.spc.nl (Luc Rooijakkers)
Cc: lwj@cs.kun.nl, ietf-charsets@INNOSOFT.COM
Message-id: <9308150149.AA21619@necom830.cc.titech.ac.jp>

> > > > It is very useful, if the uniqueness is not achievable, to have
> > > > some short notation of regular expressions to represent all the
> > > > equivalent characters.

> I still don't see the advantages of this over canonicalization at lower
> layers, for the class of characters that you want to be canonicalized.

I don't think "regular expressions" is at lower layers. It will be used
at the upper layers such as a user interface of grep command.

My opinion is that canonicalization at lower layers is not necessary
because it could be a problem only at upper layers. And, at upper
layers, it could be handled with some short handy notations.

> > > I found it very illuminating that the June 1993 version of ECMA-35, to
> > > be proposed to ISO as a new edition of 2022, requires the lowest
> > > numbered of G0/G1/G2/G3 to be used when a character is present in
> > > multiple sets, *even if a higher numbered set  is already invoked
> > > and the lowest numbered set is not* (clause 7.5). This amounts to a
> > > version of uniqueness.
> > 
> > I don't think it any useful.
> 
> I wasn't implying anything about its usefulness, but merely pointing out
> that ECMA people have apparently feld the need for such a mechanism,

I can't understand why something not useful is considered to be necessary
by ECMA. Are there any political reason? (reply with private mail, for
political things, please).

Anyway,

> If the *sender* has a table, because is uses multiple character sets,
> this does not mean that the *receiver* has that table, too. 

as you can see with ISO-2022-JP, all the characters in 2022 can be encoded
with G0 only. So, unless you have some prior negotiation on profiling
of ISO 2022 such as EUC, there can not be uniqueness assurred by the sender,
and if the prior negotiation is assumed you are also free to negotiate
on the uniqueness.

> > And, do you think 'A' in JIS X0208 is identical to 'A' in ASCII?
> > Do you think Han characters of GB, CNS, JIS, KCS unified in ISO 10646
> > the same cahracters?
> 
> As you have very vocally pointed out on numerous mailing lists, there
> are apparently people that don't think the Han unification was a good
> idea. Since I know next to nothing about eastern languages, I have no
> opinion on the matter.

It does not matter whether you think Han unification is good or bad. But,
if a single character is labeled as

	JIS_X0208-1990_XX_YY

and

	CJK_UNIFIED_HAN_ZZZZ

it is just as bad as having both

	ASCII_CAPITAL_LETTER_A

and

	ISO_8859_CAPITAL_LETTER_A

Are these labels represent the same character?

> However, for ASCII this is a different case. I was under the impression
> that the ASCII "subsets" of GB, JIS, KCS, etc were in fact identical in
> meaning to the corresponding ASCII characters.

I also believe so, though there is no correspondence between quotation marks
of ASCII (shared for opening and closing) and quatation marks of JIS
X0208 (different code points for opening and closing).

> Could you clarify why you
> believe this not to be the case?

The problem is that there are too much people who believes differently.
That's why there is table 122 of 10646 for full width forms.

						Masataka Ohta

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)

Received on Saturday, 14 August 1993 18:52:50 UTC