RE: General policy from Luc Rooijakkers on 1993-08-02 (ietf-charsets@w3.org from July to September 1993)

From: Luc Rooijakkers <lwj@cs.kun.nl>
Date: Mon, 02 Aug 1993 22:05:23 +0100
To: Harald Tveit Alvestrand <harald.t.alvestrand@delab.sintef.no>
Cc: ietf-charsets@INNOSOFT.COM
Message-id: <9308022105.AA01620@opus.spc.nl>
Otha writes:

> > So, has everybody on this list agreed that
> >  
> >         we should provide a single universal encoding of text usable
> > 	by (almost) all existing protocols so that we do not have to
> > 	extend all the protocols

I think this is a worthwhile goal.

Harald writes:

> - We should keep our minds open for, and expect to see within the next
>   10 years, a single standard blessed by ISO that has all the properties
>   that we desire, and should be adopted by us.

This means that we should stay out of areas that may be touched by ISO
(in terms of encoding space), even so-called "reserved for private use"
areas, since ISO seems to have a habit of retracting such reservations later.
It follows that whatever encoding we agree on, should have the "UCS" space
totally separated from the "extended" space.

> - We should do whatever we need to do to get things to work in the meantime.

This is of course ground for lots of debate, since everybody seems to
have a different definition of "working".

> I've got an idea that this requires our protocols to do character set
> *labelling*, and that character set *switching* may not be required,
> since there should be only approximately 4 things to label:

Are you thinking about labeling at the character level or at some
higher level like ISO 2022 does? The former has the advantage of
a stateless encoding and also enjoys the advantages of character set
switching a la 2022 (in particular, easy extensibility) except code size,
and that can be compensated somewhat by careful design of the encoding.

> - US-ASCII
> - ISO 8859-1 (and other temporary, traditional means like 2022-jp)
> - Our 10-year hack
> - The "Final Solution".

Since the "Final Solution" will presumably be a variant of 10646,
there is no need to separate US-ASCII and 8859-X from the others, since
all of these are part of 10646. 

I would thus phrase your list differently:

- ISO 10646
- any ISO-registered character set

Of course, we should define shortcuts for particular sets in order to
shorten the coding sequences.

Note that it is important to require that characters from some subset of
10646 *always* be coded as 10646 (for example, everything in 8859-X),
although you can debate *which* subset.

> > 	be for plain text processing
> If you mean "be able to represent plain text, but we should ignore the
> issues of underlining, emphasis, font size and so on", I agree.

Do you mean "should *not* ignore"?

> > 	be ASCII compatible
> If you mean "be able to represent US-ASCII as a proper subset", I agree.
> I'm not sure that the requirement to let US-ASCII text be legal text in
> the encoding is a necessary requirement.

There are difficult kinds of compatibility here.
In order of increasing usefulness:

   (1) all characters from US-ASCII are included
		
   (2) pure US-ASCII documents need no conversion

   (3) octets < 128 occur only when representing US-ASCII
   
A solution must have property (1) to be at all useful, so there is no
point in arguing it.

Property (2) is desirable, because it means that to some extent old
implementations will be able to interoperate with new ones
(no mode switch is needed for a new receivers to support old senders).

Property (3) is also desirable, because it means that some old
implementations may in fact do the correct thing when presented with
new input (consider DNS or most other 8-bit clean implementations).

Note that (3) can be weakened somewhat by restricting the class of
self-representing octets.

UTF-1 has properties (1), (2) and a weakened version of (3);
UTF-2 has all three properties.

> > 	be universal
> If you mean "be able to encode all known and tabulated writing systems,
> and be extensible to cover new ones as they are tabulated", yes.

The easiest answer to this is to use the 2022 labeling in some form,
because that makes new coded character sets accessible from the moment
they are ISO-registered.

> > 	satisfy causality
> Causality = no 2 glyphs are represented by the same octet string sequence.
> (The non-unification requirement) (watch out for meaning of the word "glyph)

I believe what Otha meant was: no octet sequence representing a valid
glyph is a prefix of an octet sequence representing a different glyph.
What this means is that it is possible to recognize complete octet
sequences without requiring lookahead (which is impossible for
interactive applications).

There are of course various ways to strengthen this property, such as
requiring that the length of an octet sequence is implied by the first
octet (which UTF-2 satisfies, by the way).

> > 	have finitestateness
> Finitestateness - all glyph sequences generatable by the encoding can
> be enumerated. Note the possible conflict with "universal". Yes.

Agreed.

> > 	is finitely resynchronizable
> Yes.

Agreed.

Otha, what happened to your "uniqueness" constraint, i.e., the
requirement that some class of characters have only one representation?
Although not in general achievable, it is *very* useful for a restricted 
class, e.g. all of 8859-X.

--
Luc Rooijakkers                                 Internet: lwj@cs.kun.nl
SPC Company, the Netherlands                    UUCP: uunet!cs.kun.nl!lwj


--Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Monday, 2 August 1993 14:30:32 UTC