- From: John C Klensin <john-ietf@jck.com>
- Date: Fri, 08 Jun 2012 03:29:47 -0400
- To: Peter Saint-Andre <stpeter@stpeter.im>, public-iri@w3.org
--On Thursday, June 07, 2012 13:47 -0600 Peter Saint-Andre <stpeter@stpeter.im> wrote: >> 4. Strangely, RFC 6365 does not define "UCS", so I suppose >> it's OK to define that here. I can't speak for Paul's reasoning because I don't think we discussed it explicitly, but omitting it from 6365 was deliberate on my part. There were two reasons. The first is that "universal character set" is itself ambiguous as to whether it refers to the Unicode/10646 code set or some other attempt. One can define that problem away if one assumes that the readers will carefully refer to the definitions even when they thing they know what a term means (my experience indicates that rarely happens but YMMD). Second and far more important, I think we do ourselves and our audience no favors by using essentially synonymous terms interchangeably to refer to the same thing. It does not help with understanding and may cause confusion. The practice at the time RFC 2277 was written was to call that thing "ISO 10646" (not correct when 2277 was written, but see below). Once we discovered (more or less around the time RFCs 3454 and 3490 were coming together that we had clear requirements for property tables (and at the time, encodings) that were not part of ISO/IEC 10646 itself, the practice shifted toward calling that thing "Unicode". We've gotten most of the community used to seeing those two terms as mostly interchangeable and being clear about the distinction when it is important. Introducing "UCS" to the mix adds no value and risks reopening the mini-flap about our combining "character repertoire", "code set" (or "CCS"), and "encoding" into "charset" in RFC 2277 (and, earlier, RFC 1341 and its successors). (Massive nit-pick follows, but these things actually are important if one wants a clear and useful definition) I don't believe 3987bis should define "UCS"; I believe it should get rid of the term entirely even if that means rewriting some sentences rather than just performing string substitution. As an example of the desirability of doing this, please read the first paragraph of Section 2.1 [draft-ietf-iri-3987bis-11]. First, despite the earlier definition and the use of "Universal Character Set in the Abstract [1] it notes "Universal Character Set" in parentheses, and then cites [ISO10646]. The intervening comma implies that those are two separate definitions, adding to the potential confusion. Second, this definition (and the other definitions, see [1] below) appears to pretend that Unicode and ISO/IEC 10646 are the same, which they are not. RFC 6365 was extremely careful about the relationship, which is another reason to use it rather then defining new terms. There is an incidental problem about what "primarily" means in the key sentence. There doesn't seem to be any nearby explanation. If there isn't one, it should be dropped. Recommendation: In Section 2.1, Old: The IRI syntax extends the URI syntax in [RFC3986] by extending the class of unreserved characters, primarily by adding the characters of the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject... New: The IRI syntax extends the URI syntax in [RFC3986] by extending the class of unreserved characters by adding the characters (code points) of ISO/IEC 10646 [ISO10646] outside the ASCII repertoire, subject... No "primarily", no expecting the user to either know all this stuff (in which case a large chunk of this document would be unnecessary) or to go running off to figure out what "U+007F" means in this context, etc. If one doesn't find citation-as-object offensive (the editors of this document apparently do not) and also notes that "URI syntax" is used without a citation in Sections 1.2 (perhaps there should be a citation there, but, if it is there it is not needed here and if it is not needed there, then it isn't needed here either) the above can be further shortened to New (short version): The IRI syntax extends the URI syntax by extending the class of unreserved characters by adding the characters (code points) of [ISO10646] outside the ASCII repertoire, subject... In spot-checking the document further, I realized that "plain text" is actually not defined anywhere. If it is going to be used at all, I think it deserves a definition or citation. Peter's comment that most of its uses should actually be "running text" still applies. ------------------ [1] Adding to the mess, the usage of "Universal Character Set" in the Abstract is followed by a citation of Unicode _and_ ISO 10646 (note that the latter is simply wrong, even though in popular use in the IETF), one that tries to avoid the RFC Editor's "no citations in Abstracts" rule by changing the brackets to parens, something that should never fly). But it effectively leaves us with three nearly-identical definitions: (i) "Universal Character Set" in the Abstract, defined by reference as "(Unicode/ISO 10646)", (ii) the actual definition in Section 1.3, which is of UCS, not "Universal Character Set", defined as (behold!) "Universal Character Set" which is then defined as "ISO/IEC 10646" (correctly, not "ISO 10646") "and the Unicode Standard", and, (iii) then we have the new inline definition in Section 2.1, best, john
Received on Friday, 8 June 2012 07:31:58 UTC