- From: Luc Rooijakkers <luc@opus.spc.nl>
- Date: Fri, 06 Aug 1993 18:53:56 +0100
- To: Masataka Ohta <mohta@necom830.cc.titech.ac.jp>
- Cc: Luc Rooijakkers <lwj@cs.kun.nl>, ietf-charsets@INNOSOFT.COM
> > > It is very useful, if the uniqueness is not achievable, to have > > > some short notation of regular expressions to represent all the > > > equivalent characters. > > > > This merely moves the burden up to the user, to type that regular > > expression, > > That's why a *SHORT* notation is very useful. I still don't see the advantages of this over canonicalization at lower layers, for the class of characters that you want to be canonicalized. Also, since the sender presumably knows the character sets it is using, it has a much better chance of having the appropriate information than the receiver. > > I found it very illuminating that the June 1993 version of ECMA-35, to > > be proposed to ISO as a new edition of 2022, requires the lowest > > numbered of G0/G1/G2/G3 to be used when a character is present in > > multiple sets, *even if a higher numbered set is already invoked > > and the lowest numbered set is not* (clause 7.5). This amounts to a > > version of uniqueness. > > I don't think it any useful. I wasn't implying anything about its usefulness, but merely pointing out that ECMA people have apparently feld the need for such a mechanism, which means that we don't need to feel bad about it, should we feel the need. Of course, this is second-guessing their intentions; it would be useful if someone could dig up the reasons for this change. > As the code points of some character varies with no regularlity in > different character sets, it means you must have a table. And, if > you have such a table, it is not at all difficult for the receiver > side to use the table to disambigufy a character with multiple > representations. If the *sender* has a table, because is uses multiple character sets, this does not mean that the *receiver* has that table, too. > And, do you think 'A' in JIS X0208 is identical to 'A' in ASCII? > Do you think Han characters of GB, CNS, JIS, KCS unified in ISO 10646 > the same cahracters? As you have very vocally pointed out on numerous mailing lists, there are apparently people that don't think the Han unification was a good idea. Since I know next to nothing about eastern languages, I have no opinion on the matter. However, for ASCII this is a different case. I was under the impression that the ASCII "subsets" of GB, JIS, KCS, etc were in fact identical in meaning to the corresponding ASCII characters. Could you clarify why you believe this not to be the case? -- Luc Rooijakkers Internet: lwj@cs.kun.nl SPC Company, the Netherlands UUCP: uunet!cs.kun.nl!lwj --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Friday, 6 August 1993 10:43:57 UTC