Re: Updating the mailto URI scheme for better I18N

Hello Frank,

Many thanks for your comments.

At 21:57 05/02/14, Frank Ellermann wrote:
 >
 >Martin Duerst wrote:
 >
 >> The syntax of 'mailto' URIs from [RFC2368] is extended to
 >> be compatible with IRIs ([RFC3987]) for better
 >> internationalization.
 >
 >Fascinating.  I expected that it would be near to impossible
 >to fix 2368 for some decades. ;-)  And I didn't know that the
 >IANA URI-registry is incomplete, they don't have RfC 2324:
 >
 >| coffee-url  =  coffee-scheme ":" [ "//" host ]
 >|                ["/" pot-designator ] ["?" additions-list ]

Well, that includes a lot of schemes, many of them not
legal for URIs or IRIs :-).

 >> the mailto URI scheme also allows setting mail header fields
 >> and the message body.
 >
 >I'd like to add some minor points about this in 2368bis, the
 >old 2368-remark...
 >
 >| Only the Subject, Keywords, and Body headers are believed to
 >| be both safe and useful.
 >
 >...does not exactly reflect common practice.  Cc: is clearer
 >than mailto:what@ever.example?to=an@other.example constructs.

I changed the text to read "Only a limited set of headers such
as the Subject, Keywords, and Body headers are believed
to be both safe and useful in the general case."

 >Maybe add it below the "to" example for "hname":
 >
 >>     mailto:addr1%2C%20addr2
 >>     is equivalent to
 >>     mailto:?to=addr1%2C%20addr2
 >>     is equivalent to
 >>     mailto:addr1?to=addr2
 >
 >    The latter form is NOT RECOMMENDED.  If the desired effect
 >    is to specify a secondary recipient mailto:addr1?cc=addr2
 >    can be used.

Added. BTW, I wonder why it's mailto:?to=addr1%2C%20addr2 and
not mailto:?to=addr1%2Caddr2. If the later is okay, I'd prefer that
(similar for mailto:addr1%2C%20addr2).

Also, your solution assumes that To: and Cc: are equivalent.
They are in terms of where the message gets sent, but they
are often not in terms of what gets done with the message.
Some people are rather strict in taking messages sent with
To: as more important that those with Cc:, and that's correct
according to the original meaning of these terms.

 >Back to your corrections:
 >
 >> A previous version of the mailto URI scheme had severe
 >> limitations for non-ASCII characters.
 >
 >That's dubious.

I no longer found this text in the draft.

 >All you really do is to add unencoded IDNs,
 >and IDNs didn't exist when 2368 was written.  It's no "severe
 >limitation", if <a href="mailto:martin@d%C3%BCrst.example">
 >might work with future browsers, when the equivalent form
 ><a href="mailto:martin@xn--drst-0ra.example"> works today.

The limitations are not limited to domain names. There are
also limitations for 'body'. But you found that out below.

 >Sure, the IRI version needs two bytes less than the punycode
 >version in this example.

It's not about bytes. Users are used to do copy-paste, and
with RFC2368bis and IRIs, they can do that.

 >OTOH it doesn't work with old user agents.

Agreed. The draft is clear about this.


 >[Found later in your draft:  Okay, the body= UTF-8 stuff is
 > really new, and that could be seen as a "severe limitation"
 > today.  I certainly don't miss it, my MUA does not support
 > the body= feature at all.]
 >
 >> more straightforward and consistent internationalization.
 >
 >Yes, in theory.  But not yet in practice, if I publish a
 >mailto:-URL somewhere, then I want it to work for almost all
 >users today.  You explain this later in chapter 6.

Yes. This draft is forward-looking, not a backward-looking.

 >>        hname       = *urlc
 >>        hvalue      = *urlc
 >
 >RfC 2368 and your draft use "urlc" without proper syntax or
 >explanation, please add something like this:
 >
 >         urlc = %d33-36 / %d38-60 / %d62 / %d64-126

Fixed, but in a different (more RFC 3986 style) way.

 >RfC 3968 apparently says nothing about "<" and ">", is this as
 >you want it ?

Yes, it doesn't say anything about "<" and ">" because these
are not allowed. RFC 2396 was a bit special in talking
about characters that are not allowed; that's not the
usual way to describe protocol.

I'm wondering whether I should remove ";" from the list
of characters allowed in hname/hvalue, and allow it as
a separator.

Any opinions, anybody.

 >Otherwise you get %d33-36 / %d38-59 / %d64-126.
 >Plain text examples:
 >
 >mailto:no@body.example?subject=is%20<this>%20okay%3F

No.

 ><mailto:no@body.example?subject=%3Cthat%3E%20is%20clear!>
 >
 >BTW, please add a note about mailto-IRIs in documents, where
 >the document charset is not UTF-8.  If I got your draft right,
 >the idea is to use percent-encoded UTF-8 even if the document
 >charset is something else like Latin-1.  Example, in this
 >article I use Latin-1, but a <mailto:martin@d$B—S(Bst.example> is
 >an invalid URL, and it's also no IRI.

That got garbled by my Japanese MUA. Looking at the original
at http://lists.w3.org/Archives/Public/uri/2005Feb/0014, the
relevant character shows up correctly in my browser. But the
message was sent as US-ASCII. This means that any interpretation
of the byte in question (0xFC) is highly questionable.

If the message had correctly been labeled as iso-8859-1
(or iso-8859-2 or so), then this would have been a
prefectly fine IRI example (though of course not a
working email address). Remember, what counts for URIs
and IRIs are characters, not bytes. So, if in Latin-1,
do as in Latin-1 (as long as you can, of course).

 >A <mailto:body@check.example?body=d$B—S(Bst> is also invalid here,
 >or isn't it ?  I'm not sure about these examples, there's no
 >obvious technical problem with this body=d$B—S(Bst parameter of a
 >2368-mailto-URL in a Latin-1 article.

Yes, this is perfectly fine (except from the fact that your
message was labeled as US-ASCII). There is quite a bit of text
explaining this in the IRI spec.

 >> URI producers should provide these domain names in the IDNA
 >> encoding, rather than percent-encoded, if they wish to
 >> maximize interoperability with legacy mailto: URI
 >> interpreters
 >
 >Indeed, unfortuately you can't say SHOULD here.
 >
 >> Percent-encoding in the LHS of an email address is reserved
 >> for potential future internationalization.  Non-ASCII
 >> characters must first be encoded according to UTF-8 [STD63]
 >
 >The first statement is only correct for Non-ASCII,

Ok, added "of non-ASCII octets".

 >there's no
 >general problem with percent-encoding in the LHS of addresses
 >in mailto URLs.  The "quoted string" case of a LHS can be very
 >weird.
 >
 >> Within mailto URIs, the characters "?", "=", "&" are reserved.
 >
 >Maybe add a forward reference to chapter 5 here about NO-WS-CTL
 >and WSP.  I don't find a general rule about this issue in 3986,
 >probably I'm missing something obvious (?).

Which issue exactly?

 >> 1.  MIME encoded words (as defined in [RFC2047]) are permitted
 >>     in header values, but not in an hvalue of a "body" hname.
 >
 >That's clear.  You aren't planning to invent a mailto-IRI-body,
 >or are you ?  Oops, I found body=caf%3C%A9 later, now that's a
 >PITA,

Sorry, I think I'm acronym challenged, I don't understand PITA.

 >by using mailto-IRI-bodies the MUA is more or less forced
 >to generate a Content-Type: text/plain;charset=utf-8 with QP or
 >Base64.

Why QP or Base64? Content-Transfer-Encoding: 8bit
should be fine, or not?


 >If you really want this, please say so not only in an
 >example.  This has side effects on systems, where the default
 >local charset is _not_ Unicode (any of the UTFs).

I say that encoding is necessary with the following sentence:

"except for an hvalue of a "body" hname, which has to be
encoded according to <xref target="RFC2045"></xref>."

For further details, the readers can check that spec.
In my opinion, this is good enough. The 'local charset'
doesn't have anything to do here, it would only confuse
issues.

 >> MIME encoded words and UTF-8-based percent-encoding SHOULD not
 >> both be used in the same hvalue.
 >
 >Maybe you need a MUST NOT here,


 >and definitely a NOT.

Done.

 >Examples:
 >
 >mailto:an@example?subject=%3D%3Fus-ascii%3FQ%3F1%3F%3D_2%3F%3D
 >mailto:an@example?subject=%3D%3Fus-ascii%3FQ%3FD%C3%BCrst3F%3D
 >
 >Whatever that is, it's no Subject: 1?= 2 or Subject: d$B—S(Bst. (?)

Ah, now I understand the proposal for MUST. What I meant when
saying "SHOULD not both be used in the same hvalue." was more
something like
mailto:an@example?subject=%3D%3Fiso-8859-1%3FQ%3FD%3DFCrst3F%3D%20D%C3%BCrst
i.e. using both methods in parallel. Your examples seem to
suggest to use them on top of each other.

I have changed the wording to:
MIME encoded words and UTF-8-based percent-encoding SHOULD NOT both be used 
sequentially in the same hvalue, and MUST NOT be combined.

I hope this expresses both what I was thinking about and what
you were thinking about.

 >> The creator of a mailto URI cannot expect the resolver of a URI
 >> to understand more than the "subject" and "body" headers.
 >[...]
 >
 >Here's a place where you could explain, why clients should try
 >to support in-reply-to, and how "URI producers" should use it.
 >
 > [in the examples:]
 >> ?In-Reply-To=%3C3469A91.D10AF4C@example.com%3E>
 >
 >Here's the place, where you could say that this should be the
 >Message-ID of the mail in question.  One popular software gets
 >this wrong and apparently uses the last Message-ID found in the
 >References or in In-Reply-To to construct its mailto-URL.  That
 >confuses the threading of mail replies based on the mailto-URL.

I improved the description for the example, but I don't
want to go further. This is a spec, not implementation
criticism.

 >> Another way of expressing the same thing:
 >> <mailto:?to=joe@example.com&cc=bob@example.com&body=hello>
 >
 >Please delete this example, it's ugly.  You already have this
 >variant in the paragraph about "to" as "hname".

Done.

 >> Click <a
 >> href="mailto:?to=joe@xyz.com&amp;cc=bob@xyz.com&amp;body=hello">
 >> mailto:?to=joe@xyz.com&amp;cc=bob@xyz.com&amp;body=hello</a> to
 >> send a greeting message to Joe and Bob.
 >
 >I'd use an.example instead of xyz.com here, and replace the "to":
 >
 >  Click <a
 >  href="mailto:joe@an.example?cc=bob@a.example&amp;body=hello">
 >  mailto:joe@an.example?cc=bob@an.example&amp;body=hello</a> to
 >  send a greeting message to Joe with a copy to Bob.

Done.

 >> mailto:user@example.org?subject=%3D%3Futf-8%3FQ%3Fcaf%3DC3%3DA9%3F%3D
 >
 >Maybe replace user@example.org by an@example if your examples
 >are otherwise too long for RfC lines.

They fit exactly. But thanks for the tip.

 >> The software sending the email is not restricted to UTF-8, but
 >> can use other encodings.
 >
 >It's more or less forced to stick to UTF-8 or maybe another UTF.
 >Otherwise it would have to analyze the mailto-IRI-body assuming
 >UTF-8 input.  That's a major difference from traditional mailto-
 >URLs.

I agree that using UTF-8 is much easier to implement. But
I can immagine that some MUA wants to check e.g. for Japanese-
only -> use JIS; Latin-1 only -> use iso-8859-1.

 >> The security considerations of [STD66], [RFC3490], and also
 >> apply. [RFC3987]
 >
 >s/apply. [RFC3987]/[RFC3987] apply./

Fixed (a while ago).

 >IMHO "also apply" is not good enough.  Either add some of the
 >worst examples like say illegal UTF-8 encodings and phishing, or
 >urge the readers to really check out these "external" sources.

Done (the later).

 >Please add a note, that a plain text <URL:mailto:an@example>
 >MUST NOT use any percent encoded UTF-8, and is by definition
 >a "visible with any browser" URL, not an IRI.

I'm not sure that's necessary. There are enough examples
in the spec that should make that clear.


Regards,     Martin. 

Received on Monday, 6 March 2006 13:12:43 UTC