RE: Thoughts about characters transmission from Masataka Ohta on 1993-07-20 (ietf-charsets@w3.org from July to September 1993)

From: Masataka Ohta <mohta@necom830.cc.titech.ac.jp>
Date: Tue, 20 Jul 1993 14:10:06 +0900 (JST)
To: lwj@cs.kun.nl
Cc: Guido.van.Rossum@cwi.nl, ietf-charsets@INNOSOFT.COM
Message-id: <9307200510.AA08531@necom830.cc.titech.ac.jp>
> Although I have no intention to start a religious war, I would like
> to point out some technical difficulties with Otha's proposed IUTF.

OK. Your comment is, at least, syntactically technical.

> Please refer to my earlier posting about NET-TEXT for an alternative
> proposal.

What is NET-TEXT?

> > BTW, I now think that, if we are to use almost raw UTF2 as interim encoding

> UTF-2 (as used by Plan 9, I don't have an X/Open reference) requires that
> the *shortest* sequence be used (although programs may not check it),

That's why I wrote "almost raw UTF2".

> thus this would make your coding incompatible with UTF-2.

Do you mind? UTF2 is not part of ISO 10646.

> a bit strange. Consider that T5 = 1111110x and the
> five following Tx bytes have only 30 bits available: there is no
> way to represent codes >= 2^31 (or maybe these don't occur
> in ISO 10646; please enlighten me if this is the case).

It was my proposal to make UCS4 31 bit. And the proposal was accepted by
ISO long before. So, UCS4 is 31 bit.

The reason is that, with 31 bit UCS, there is no difference between
signed and unsingned quantity and that users can use the MSB of a
32bit-word for an internal user defined flag. So, 31bitness is quite
favourable for the actual processing.

> >         A':33~46,48~126

> I don't see the reason for introducing A'; could you explain please?

To make IUTF compatible to UTF2's intention. The code point 47 is for '/'
and file system of unmodified, raw UNIX won't accept it.

> >         T1 A'                   2976
> >         T2 A'                   1488
> >         U1 A'                   1488
> >         U1 Tx                   1024
> >         T1 T2                   512
> >         T1 U1                   512
> >         U1 T2                   256
> >         S2 Tx A'                35712
> >         S3 Tx A' Tx             >2^21
> >         S4 Tx A' Tx Tx          >2^25
> >         S6 Tx A' Tx Tx Tx Tx    >2^36
> >         S7 Tx A' Tx Tx Tx Tx Tx >2^42
> 
> These sequences destroy the resynchronisation property: consider what
> happens if you hit an internal non-Tx byte: how would you know that it
> was internal? E.g. consider
> 
> 	T1 A'		and		T1 T2 A'
> 
> The "intended" parsing is
> 
> 	[T1 A']		and		[T1 T2] [A']

Sure. And, as 'T1' can appear only at the start of the IUTF sequence, that
is the only possible parsing.

> but you could also parse them as
> 
> 	... T1] [A']	and		... T1] [T2 A']
> 			and		... T1 T2] [A']

You can't. This is the simplest case. The octets which can terminate an
IUTF sequence are 'A' 'Tx' 'T2' and 'U1'. Not 'T1'.

And, you can disambiguate state even in complex cases by looking backward
at most 8 characters (for 42 bit encoding).

You can also disambiguate state by looking ahead at most 8 characters
(for 42 bit encoding).

> > Hash  tables  could be used for the fast translation from
> > ICODE to IUTF for such shorthand notations.
> 
> This seems a bit too complex for the purpose.

Are you sure? What is your purpose? Do you have any experience in
programming? It is quite simple.

	struct {wchar_t icode; char *iutf;} hasht[TABLESIZE], *hashp;

	char *icode2iutf(icode)
	wchar_t icode;
	{static answer[6];
+		for(hashp=hasht[icode%TABLESIZE];hashp->iutf;)
+		{	if(hashp->icode==icode)
+				return hashp->iutf;
+			hashp++;
+			if(hashp==hasht+TABLESIZE)
+				hashp=hasht;
+		}
		/* do regular conversion of UTF2 */
		/* for ASCII */
-		if(icode<0x7f)
-		{	answer[0]=icode;
-			answer[1]=0;
-			return &answer;
-		}
		/* for Euro-centrism */
-		if(icode<0x7ff)
-		{	answer[0]=((icode>>6)&0x1f)|T1;
-			answer[1]=(icode&0x3f)|0x80;
-			answer[2]=0;
-			return &answer;
-		}
		/* for poor languages */
		if(icode<0xffff)
		{	answer[0]=((icode>>12)&0xf)|T1;
			answer[1]=((icode>>6)&0x3f)|0x80;
			answer[2]=(icode&0x3f)|0x80;
			answer[3]=0;
			return &answer;
		}
		.....

There is only 7 lines added (marked with '+'). Moreover, it is possible
to merge the processing of the conversion of ASCII characters (and any
other frequently used characters) to reduce both the code length (lines
marked with '-') and processing time.

Further minor improvements is possible for the generation of initial
hashp value to avoid modular operations and such.

The data for the initialization of the table is for about 10,000
characters (for two octet form only) and will be 64KB or so. It would
be less than 1MB even if all three octet forms are assigned.

If you are serious about internationalization, such amount of irregular
data is negligible.

For the input conversion of European characters only. For the input of
Japanese, you need about 1MB of dictionary. For the output of characters,
you need tens of mega bytes of font image data of course.

And, if your purpose is not serious and want to process ASCII and some other
characters only, you don't have to worry about the conversion and you
don't have to have the conversion table, because you won't need the
conversion.

						Masataka Ohta

--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Tuesday, 20 July 1993 16:39:54 UTC