Re: rfc3987bis and RFC 6365 from John C Klensin on 2012-06-08 (public-iri@w3.org from June 2012)

From: John C Klensin <john-ietf@jck.com>
Date: Fri, 08 Jun 2012 03:29:47 -0400
To: Peter Saint-Andre <stpeter@stpeter.im>, public-iri@w3.org
Message-ID: <90505446CE0DCCDE63978907@JcK-HP8200.jck.com>
--On Thursday, June 07, 2012 13:47 -0600 Peter Saint-Andre
<stpeter@stpeter.im> wrote:

>> 4. Strangely, RFC 6365 does not define "UCS", so I suppose
>> it's OK to define that here.

I can't speak for Paul's reasoning because I don't think we
discussed it explicitly, but omitting it from 6365 was
deliberate on my part.  There were two reasons.  The first is
that "universal character set" is itself ambiguous as to whether
it refers to the Unicode/10646 code set or some other attempt.
One can define that problem away if one assumes that the readers
will carefully refer to the definitions even when they thing
they know what a term means (my experience indicates that rarely
happens but YMMD).  Second and far more important, I think we do
ourselves and our audience no favors by using essentially
synonymous terms interchangeably to refer to the same thing.  It
does not help with understanding and may cause confusion.  The
practice at the time RFC 2277 was written was to call that thing
"ISO 10646" (not correct when 2277 was written, but see below).
Once we discovered (more or less around the time RFCs 3454 and
3490 were coming together that we had clear requirements for
property tables (and at the time, encodings) that were not part
of ISO/IEC 10646 itself, the practice shifted toward calling
that thing "Unicode".  We've gotten most of the community used
to seeing those two terms as mostly interchangeable and being
clear about the distinction when it is important.   Introducing
"UCS" to the mix adds no value and risks reopening the mini-flap
about our combining "character repertoire", "code set" (or
"CCS"), and "encoding" into "charset" in RFC 2277 (and, earlier,
RFC 1341 and its successors).


(Massive nit-pick follows, but these things actually are
important if one wants a clear and useful definition)

I don't believe 3987bis should define "UCS"; I believe it should
get rid of the term entirely even if that means rewriting some
sentences rather than just performing string substitution.  As
an example of the desirability of doing this, please read the
first paragraph of Section 2.1 [draft-ietf-iri-3987bis-11].
First, despite the earlier definition and the use of "Universal
Character Set in the Abstract [1] it notes "Universal Character
Set" in parentheses, and then cites [ISO10646].  The intervening
comma implies that those are two separate definitions, adding to
the potential confusion.   Second, this definition (and the
other definitions, see [1] below) appears to pretend that
Unicode and ISO/IEC 10646 are the same, which they are not.  RFC
6365 was extremely careful about the relationship, which is
another reason to use it rather then defining new terms.

There is an incidental problem about what "primarily" means in
the key sentence.   There doesn't seem to be any nearby
explanation.  If there isn't one, it should be dropped.

Recommendation:  In Section 2.1, 

Old:
	The IRI syntax extends the URI syntax in [RFC3986] by
	extending the class of unreserved characters, primarily
	by adding the characters of the UCS (Universal Character
	Set, [ISO10646]) beyond U+007F, subject...

New:
	The IRI syntax extends the URI syntax in [RFC3986] by
	extending the class of unreserved characters by adding
	the characters (code points) of ISO/IEC 10646
	[ISO10646] outside the ASCII repertoire, subject...

No "primarily", no expecting the user to either know all this
stuff (in which case a large chunk of this document would be
unnecessary) or to go running off to figure out what "U+007F"
means in this context, etc.  If one doesn't find
citation-as-object offensive (the editors of this document
apparently do not) and also notes that "URI syntax" is used
without a citation in Sections 1.2 (perhaps there should be a
citation there, but, if it is there it is not needed here and if
it is not needed there, then it isn't needed here either)  the
above can be further shortened to 

New (short version):
	The IRI syntax extends the URI syntax by extending the
	class of unreserved characters by adding the characters
	(code points) of [ISO10646] outside the ASCII
	repertoire, subject...


In spot-checking the document further, I realized that "plain
text" is actually not defined anywhere.  If it is going to be
used at all, I think it deserves a definition or citation.
Peter's comment that most of its uses should actually be
"running text" still applies.

   ------------------
		
[1] Adding to the mess, the usage of "Universal Character Set"
in the Abstract is followed by a citation of Unicode _and_ ISO
10646 (note that the latter is simply wrong, even though in
popular use in the IETF), one that tries to avoid the RFC
Editor's "no citations in Abstracts" rule by changing the
brackets to parens, something that should never fly).  But it
effectively leaves us with three nearly-identical definitions:
(i) "Universal Character Set" in the Abstract, defined by
reference as "(Unicode/ISO 10646)", (ii) the actual definition
in Section 1.3, which is of UCS, not "Universal Character Set",
defined as (behold!) "Universal Character Set" which is then
defined as "ISO/IEC 10646" (correctly, not "ISO 10646") "and the
Unicode Standard", and, (iii) then we have the new inline
definition in Section 2.1,


best,
    john
Received on Friday, 8 June 2012 07:31:58 UTC