Proposal for insensitive registration from Mark Davis on 2002-07-23 (ietf-charsets@w3.org from July to September 2002)

From: Mark Davis <mark.davis@us.ibm.com>
Date: Mon, 22 Jul 2002 18:36:26 -0700
To: Markus Scherer <markus.scherer@jtcsv.com>
Cc: charsets <ietf-charsets@iana.org>
Message-id: <OF1C0D53EA.DC1812F8-ON88256BFF.00028FD5@us.ibm.com>
                                                                                                               
                                                                                                               
                                                                                                               


I have a couple of notes, then a suggested proposl.

1. I ran a check, doing the following things:
- uppercasing each string
- removing all characters except A-Z and 0-9
- removing all leading zeros (zeros not preceded by a number)

I then checked for collisions, where two different names (or aliases for
different names) matched under these circumstances. The results are that
there are only 2 collisions:

Collision between: iso-ir-91 (JIS_C6229-1984-a) and iso-ir-9-1 (NATS-DANO)
Collision between: iso-ir-92 (JIS_C6229-1984-b) and iso-ir-9-2 (NATS-DANO-
ADD)

Both of these (looking at http://www.itscj.ipsj.or.jp/ISO-IR/overview.htm)
are very old code pages, not in common use, so they can be grandfathered
in.


2. The character set that is actually in use are A-Z, 0-9 plus:

'_', example: ANSI_X3.110-1983
'.', example: ANSI_X3.110-1983
'-', example: ANSI_X3.110-1983
':', example: ISO_5427:1981
'+', example: PC-MULTILINGUAL-850+EURO
'(', example: NF_Z_62-010_(1973)
')', example: NF_Z_62-010_(1973)

Notice that the last two are in violation of http://www.ietf.
org/rfc/rfc2978.txt, and should be removed!


3. As Markus said, while there should be a limit to the names, and while
http://www.iana.org/assignments/character-sets gives one, but there are
violations in that very file:

Name >40: Extended_UNIX_Code_Fixed_Width_for_Japanese
Name >40: Extended_UNIX_Code_Packed_Format_for_Japanese
Maximum Length 45.

So the name limit should be extended to accomodate those.


4. The file in http://www.iana.org/assignments/character-sets is rather
clumsy to parse.

a. One has to key off of "-------------" at the start of line to know when
to start parsing, and "REFERENCES" to know when to stop. (And if these are
not invariants, then parsers may have to change over time!).
b. The exact format of the file is not described.


5. So I'd like to sum up the results of this discussion with a concrete
proposal. In http://www.iana.org/assignments/character-sets,

A. Replace the text:

The character set names may be up to 40 characters taken from the
printable characters of US-ASCII.  However, no distinction is made
between use of upper and lower case letters.

with the new text:

Constraints on Registered Names and Aliases

The character set names may be up to 45 characters taken from the printable
characters of US-ASCII.  As per RFC 2978 no distinction is made between use
of upper and lower case letters. While more punctuation characters are
permitted by RFC 2978, only the following should be used:
0x43    '+'     PLUS SIGN
0x45    '-'     HYPHEN-MINUS
0x46    '.'     FULL STOP
0x58    ':'     COLON
0x95    '_'     LOW LINE

In addition, two strings are considered to conflict if after uppercasing
them, then removing all characters except A-Z and 0-9, and then removing
all leading zeros (zeros not preceded by a number), the strings conflict.
No new names or aliases will be accepted for registration that conflict
with existing names or aliases, except where they only conflict with
aliases for the same name. For example, "IBM-037" is acceptable as an alias
for "IBM037", but "roman08" is not acceptable as an alias for "macintosh"
because it would conflict with "roman8", which is an existing alias for
"hp-roman8".

B. Start the data with "@START_DATA" and ending it with "@END_DATA". Add
documentation in the header:

This file is designed to be machine-readable. The data start with the line
"@START_DATA", and ends with the line "@END_DATA". Each line of data is of
the form:
  <tag> ":" <space> value1 <space>+ value2 <space>+ value3
or is a continuation line, starting with <space>. The values are
interpreted according to the tags, as follows:

Tag      Values
Name:    value1 is the name
         value2 is either blank, "(preferred MIME name)" or "[" <reference>
"]"
         value3 is either blank, or "[" <reference> "]"
Alias:   value1 is the alias
         value2 is either blank, or "(preferred MIME name)"
MIBenum: value1 is a number, described above
Source:  value1 is descriptive text. This is the only entry that can have
continuation lines.

C. Remove the alias: NF_Z_62-010_(1973)

Mark
___
mark.davis@us.ibm.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799
Received on Monday, 22 July 2002 21:37:21 UTC