Re: Registration of new charset SCSU

Kenneth Whistler wrote:
> > Additional Information:
> >     SCSU is a Character Encoding Scheme for Unicode/ISO 10646
> >     that allows significant size reduction of text compared
> >     to UCS Transformation formats.
> 
> SCSU does not meet the definition of a "Character Encoding Scheme"
> as defined in Unicode Technical Report #17, Character Encoding
> Model. UTR #17 states:
> 
> "SCSU ... should also be conceived of as [a] transfer encoding
> syntax. [It is] not appropriate for registration to get [a]
> formal charset identifier, since the compressed forms are not
> unique, but depend on the sophistication of the compression
> algorithm."

I missed this part of UTR #17. I am reading it now.

UTR #17 defines:

"A transfer encoding syntax is a reversible transform of
encoded data which may (or may not) include textual data
represented in one or more character encoding schemes."

(Emphasis in the Report on the first instance of "data".)

It then lists as examples base64, uuencode, gzip, and others.

On the other hand, SCSU is designed specifically for Unicode text,
not general data. The aspect of compression does not, according
to the rules of IANA registration and unlike stated in UTR #17,
preclude the use as a charset.

SCSU defines encoding choices for Unicode code points. It does offer
choices, but it essentially encodes Unicode text.

Again, from UTR #17:

"A character encoding scheme is a mapping of code units
into serialized byte sequences."

It is not a one-to-one mapping, but it is a well-defined, reversible
mapping - unless you define "mapping" very closely.
See also the charset examples below.

In my opinion, SCSU is closer to a CES than a TES.
At best (worst?), SCSU is in a gray zone between the CES and
TES definitions, or it is both.

> I don't know whether this would absolutely disallow a charset
> registration for SCSU, but we should be clear that while SCSU
> *de*coding is uniquely defined, SCSU *en*coding is not.

While this is true, SCSU does fit the definition of a charset:
"... a method of converting a sequence of octets into
a sequence of characters. ..."
(From the draft registration procedure, section 2.3 "Charset".)

This does not require uniqueness. Also, other charsets do not provide uniqueness:

- ISO-2022 allows the same character encoded in as many different ways as
  are accessible with different embedded charsets.
  ISO-2022 is also stateful.

- EUC-TW duplicates the entire plane 1 of CNS 11643-1992 in
  Code set 1 (2 bytes/char) and Code set 2 (4 bytes/char).
  (Reference: Ken Lunde's "CJKV Information Processing", pp. 162-164)

I find SCSU highly useful for something like XML.

> --Ken

markus

Received on Tuesday, 13 June 2000 15:34:08 UTC