Re: rfc3987bis and RFC 6365 from Martin J. Dürst on 2012-10-20 (public-iri@w3.org from October 2012)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Sat, 20 Oct 2012 16:39:48 +0900
To: Peter Saint-Andre <stpeter@stpeter.im>
CC: public-iri@w3.org
Message-ID: <50825544.5060504@it.aoyama.ac.jp>
Hello Peter, others,

On 2012/06/08 3:42, Peter Saint-Andre wrote:
> <hat type='individual'/>
>
> At IETF 84, we discussed the desirability of aligning the terminology in
> 3987bis with RFC 6365 ("Terminology Used in Internationalization in the
> IETF"). This is ticket #85 in the tracker:
>
> http://trac.tools.ietf.org/wg/iri/trac/ticket/85
>
> I've completed a review of both documents and have a few suggestions...
>
> 1. In Section 1.3, cite RFC 6365 and specify that terms are to be
> understood as defined in that document unless otherwise specified (in
> fact, now that we have RFC 6365 it's not clear why we're citing RFC
> 2130, RFC 2277, or ISO 10646). I suggest:
>
> OLD
>     The following definitions are used in this document; they follow the
>     terms in [RFC2130], [RFC2277], and [ISO10646].
>
> NEW
>     Various terms used in this document are defined in [RFC6365] and
>     [RFC3986].  In addition, we define the following terms for use in
>     this document.

Implemented in my editorial copy. Many thanks for the actual text proposal.

> 2. Don't define anew in rfc3987bis terms that are defined in RFC 6365.
> That would mean removing the following definitions from Section 1.3:
>
> - character
> - character repertoire

Done.

> - character encoding (use "character encoding scheme" or "character
> encoding form" instead)
> - charset

These two are not that simple. For background, please check
http://www.w3.org/TR/charmod/#sec-Digital.

Here is what we currently have for "character encoding":
     A method of representing a sequence
     of characters as a sequence of octets (maybe with variants). Also,
     a method of (unambiguously) converting a sequence of octets into a
     sequence of characters.

The problem with 'charset' as defined in RFC 6365 (and elsewhere) is 
that it's purely one-way, from octets to characters. But there's the 
other direction, too.

The problem with "character encoding scheme" or "character encoding 
form" is that they are much more specialized terms.

RFC 6365 has this to say after the definition of "charset":

       Many protocol definitions use the term "character set" in their
       descriptions.  The terms "charset", or "character encoding scheme"
       and "coded character set", are strongly preferred over the term
       "character set" because "character set" has other definitions in
       other contexts, particularly outside the IETF.  When reading IETF
       standards that use "character set" without defining the term, they
       usually mean "a specific combination of one CCS with a CES",
       particularly when they are talking about the "US-ASCII character
       set".

Of course, per and http://www.w3.org/MarkUp/html-spec/charset-harmful 
and as above, we sure don't want to use "character set". And we indeed 
want something to denote "a specific combination of one CCS with a CES" 
(or in some cases actually a combination of more than one CCS...), so 
neither "coded character set" (CCS) nor "character encoding scheme" 
(CES) will do, despite the suggestions above. So we just ended up with 
"character encoding", using a simple term for a very central concept, 
also in line with http://www.w3.org/TR/charmod/.

As a result of this, we only use "charset" when it's used as a label, 
with a narrowed definition: "The name of a parameter or attribute used 
to identify a character encoding."

I guess we could just drop the narrowing definition of "charset", but we 
can't drop "character encoding".


> 3. Do we really need to define "octet", "sequence of characters", and
> "sequence of octets"?

Good questions. RFC 6365 uses "octet" without defining it, so I guess we 
can drop it. I think we can also drop "sequence of characters" and 
"sequence of octets", but I'd like to get Larry's okay for these.

> 4. Strangely, RFC 6365 does not define "UCS", so I suppose it's OK to
> define that here.

Following discussions later in this thread, I'm trying to get rid of 
this. But it needs some more thought.


Regards,   Martin.
Received on Saturday, 20 October 2012 07:40:28 UTC