Re: email address in a URI from Etan Wexler on 2005-07-09 (uri@w3.org from July 2005)

From: Etan Wexler <ewexler@stickdog.com>
Date: Sat, 09 Jul 2005 15:03:48 -0400
To: Tim Kindberg <timothy@hpl.hp.com>, sandro hawke <sandro@w3.org>, URI Interest Group <uri@w3.org>
Message-ID: <42D01F94.1020005@stickdog.com>
Tim Kindberg wrote to the URI-Interest-Group list (<mailto:uri@w3.org>)
on 6 July 2005 in “email address in a URI”
(<mid:42CBAAE0.3060309@hpl.hp.com>,
<http://www.w3.org/mid/42CBAAE0.3060309@hpl.hp.com>):

> I'd appreciate comments on the following [replacement] within
> tag syntax, and the logic behind it:
> 
> emailAddress = dot-atom-text-uri "@" DNSname
> dot-atom-text-uri = 1*atext-uri * ("." 1*atext-uri)
> atext-uri= ALPHA / DIGIT / ; see RFC 2822
>             "!" / "$" /    ; only URI-compatible
>             "&" / "'" /       ; characters included
>             "*" / "+" /
>             "-" / "/" /
>             "=" / "_" /
>             "~"
> 
> The logic behind the above is:
> 0. Avoid obsolete local parts, and local parts involving CFWS
> (comments and white space, which could introduce ambiguity in who can
> use which email addresses)

The form <quoted-string> is neither listed as obsolete nor is obsolete
in practice. The “tag” scheme should account for <quoted-string> local
parts.

> 3. Can't %-encode characters without ambiguity, since RFC 2822
> allows email addresses containing % HEX HEX constructs

The ambiguity is zero. Any percent signs in the local part must be
percent-encoded. Consider an e-mail address with a percent sign:

     100%BAT@mail.example

Now consider what should be a valid “tag” URI based on that address:

     tag:100%25BAT@mail.example,2005-07-09:unambiguous

> 4. So we have to avoid / "^" / "`" / "{" / "|" / "}"

Certainly we avoid those characters in literal form.

> 5. And it seems a bad idea to allow "#" / "%" / "?"

If they are percent-encoded, what seems a bad idea?

The outstanding issue that I detect is compatibility with the syntax of
the “mailto” scheme. The “mailto” syntax in RFC 2368 and in the pending
revision of RFC 2368 has a wider scope than the “tag” scheme’s
<emailAddress> has. The “mailto” syntax intends to capture all
variations in Internet-message addressing. (That is my reading of the
specifications, anyway.) The “mailto” scheme includes a query component
in which at least the ampersand and “equals” sign are reserved. The
“tag” specification could allow the literal appearance of those
characters in e-mail addresses if cross-scheme compatibility is not a
concern.

I suggest that the “tag” scheme avoid syntactic compatibility with the
“mailto” scheme. The result of the break will be a more-readable syntax
better suited for the “tag” scheme’s narrower scope. If the “tag” scheme
sets on its own route for e-mail addresses, the following rules are in
order.

emailAddress      = tag-local-part "@" DNSname
tag-local-part    = tag-dot-atom-text / tag-no-fold-quote
tag-dot-atom-text = 1*tag-atext *("." 1*tag-atext)
tag-atext         = ALPHA / DIGIT /
                     "!"   / "%23" /
                     "$"   / "%25" /
                     "&"   / "%27" /
                     "*"   / "+"   /
                     "-"   / "%2F" /
                     "="   / "%3F" /
                     "%5E" / "_"   /
                     "%60" / "%7B" /
                     "%7C" / "%7D" /
                     "~"
tag-no-fold-quote = "'" *tag-qtext "'"
tag-qtext         = "%01" / "%02" / "%03" / "%04" /
                     "%05" / "%06" / "%07" / "%08" /
                     "%09" /         "%0B" / "%0C" /
                             "%0E" / "%0F" / "%10" /
                     "%11" / "%12" / "%13" / "%14" /
                     "%15" / "%16" / "%17" / "%18" /
                     "%19" / "%1A" / "%1B" / "%1C" /
                     "%1D" / "%1E" / "%1F" / "%20" /
                     "!"   / "%22" / "%23" / "$"   /
                     "%25" / "&"   / "%27" / "("   /
                     ")"   / "*"   / "+"   / "%2C" /
                     "-"   / "."   / "%2F" /
                     DIGIT / "%3A" / ";"   / "%3C" /
                     "="   / "%3E" / "%3F" / "%40" /
                     ALPHA / "%5B" / "%5C" / "%5D" /
                     "%5E" / "_"   / "%60" / "%7B" /
                     "%7C" / "%7D" / "~"   / "%7F"

A <tag-local-part> construct has the following characteristics.

* Any non-obsolete local part of an e-mail address is representable.
(Note: an e-mail address as described here is not the same construct as
an RFC-2822 <addr-spec>. The latter may include comments and whitespace
which have no operational semantics. To derive an e-mail address, one
removes comments and ignorable whitespace from an <addr-spec>.)

* Characters illegal in URIs — control characters, space, characters
that delimit URIs from other text, and historically URI-unwise
characters — never appear literally. Their representations are
percent-encodings.

* Characters with a reserved meaning specified in RFC 3986 — the
characters specified by the <gen-delim> rule — never appear literally.
Their representations are percent-encodings.
   * If the colon were to appear literally, it would confuse human
consumers of “tag” URIs; the colon already has a reserved meaning within
“tag” URIs, separating the <taggingEntity> from the <specific> part.
   * If the slash were to appear literally, it would introduce a
semantics of hierarchy where none belongs.
   * If the question mark were to appear literally, it would start a
query where none belongs.
   * If the number sign were to appear literally, it would start a
fragment identifier where none belongs.
   * RFC 3986 permits the square brackets in URIs only as delimiters for
Internet-Protocol-address literals.
   * If the commercial “at” were to appear literally, it would confuse
human consumers of “tag” URIs; the commercial “at” already has a
reserved meaning within “tag” URIs, separating the <tag-local-part> from
the <DNSname>.

* Apostrophes, tentatively reserved within URIs by RFC 3986, serve the
function of the <DQUOTE> quotation mark, reserved within e-mail
addresses by RFC 2822. Thus apostrophes delimit the representation of a
string within a <tag-local-part>.

* Commas never appear literally. Their representations are
percent-encodings. If the comma were to appear literally, it would
confuse human consumers of “tag” URIs; the comma already has a reserved
meaning within “tag” URIs, separating the <authorityName> from the <date>.

* There is neither the necessity nor the possibility of percent-escaping
<quoted-pair> constructs from RFC 2822. Instead, a <tag-no-fold-quote>
construct  deals with the content represented by an RFC-2822
<quoted-string> construct, not with the representation. The
transformation algorithm is as follows.

Given input of a string conforming to the <quoted-string> rule of RFC
2822 and given output that starts as an empty string:
   1. Append an apostrophe (U+0027) to the output.
   2. Remove the delimiting <DQUOTE> quotation marks (U+0022) from the
<quoted-string>.
   3. Iterating over each remaining input character, from first to last:
    3.1. If the current character is a member of the set {U+000D, U+000A}:
     3.1.1. Skip to the next iteration.
    3.2. If the current character is a backslash (U+005C):
     3.2.1. Retrieve the next character. The character retrieved is now
the current character.
    3.3. If the current character is a member of the set {U+0021,
U+0024, U+0026, U+0028, U+0029, U+002A, U+002B, U+002D, U+002E, U+0030,
U+0031, U+0032, U+0033, U+0034, U+0035, U+0036, U+0037, U+0038, U+0039,
U+003B, U+003D, U+0041, U+0042, U+0043, U+0044, U+0045, U+0046, U+0047,
U+0048, U+0049, U+004A, U+004B, U+004C, U+004D, U+004E, U+004F, U+0050,
U+0051, U+0052, U+0053, U+0054, U+0055, U+0056, U+0057, U+0058, U+0059,
U+005A, U+005F, U+0061, U+0062, U+0063, U+0064, U+0065, U+0066, U+0067,
U+0068, U+0069, U+006A, U+006B, U+006C, U+006D, U+006E, U+006F, U+0070,
U+0071, U+0072, U+0073, U+0074, U+0075, U+0076, U+0077, U+0078, U+0079,
U+007A, U+007E}:
     3.3.1. Append the current character to the output.
    3.4. If the current character is a member of the set {U+0001,
U+0002, U+0003, U+0004, U+0005, U+0006, U+0007, U+0008, U+0009, U+000B,
U+000C, U+000E, U+000F, U+0010, U+0011, U+0012, U+0013, U+0014, U+0015,
U+0016, U+0017, U+0018, U+0019, U+001A, U+001B, U+001C, U+001D, U+001E,
U+001F, U+0020, U+0022, U+0023, U+0025, U+0027, U+002C, U+002F, U+003A,
U+003C, U+003E, U+003F, U+0040, U+005B, U+005C, U+005D, U+005E, U+0060,
U+007B, U+007C, U+007D, U+007F}:
     3.4.1. Append a percent sign (U+0025) to the output.
     3.4.2. Append to the output the two hexadecimal digits (U+0030 –
U+0039, U+0041 – U+0046) representing the possibly-zero-padded integer
that is the current character’s code point, most-significant digit first.
   4. Append an apostrophe (U+0027) to the output.

If the “tag” scheme preserves syntactic compatibility with the “mailto”
scheme, I believe that the following rules are in order.

emailAddress      = uri-local-part "@" DNSname
uri-local-part    = uri-dot-atom-text / uri-no-fold-quote
uri-dot-atom-text = 1*uri-atext *("." 1*uri-atext)
uri-atext         = ALPHA / DIGIT /
                     "%21" / "%23" /
                     "%24" / "%25" /
                     "%26" / "%27" /
                     "%2A" / "%2B" /
                     "-"   / "%2F" /
                     "%3D" / "%3F" /
                     "%5E" / "_"   /
                     "%60" / "%7B" /
                     "%7C" / "%7D" /
                     "~"
uri-no-fold-quote = "%22" *uri-qtext "%22"
uri-qtext         = "%01" / "%02" / "%03" / "%04" /
                     "%05" / "%06" / "%07" / "%08" /
                     "%09" /         "%0B" / "%0C" /
                             "%0E" / "%0F" / "%10" /
                     "%11" / "%12" / "%13" / "%14" /
                     "%15" / "%16" / "%17" / "%18" /
                     "%19" / "%1A" / "%1B" / "%1C" /
                     "%1D" / "%1E" / "%1F" / "%20" /
                     "%21" /         "%23" / "%24" /
                     "%25" / "%26" / "%27" / "%28" /
                     "%29" / "%2A" / "%2B" / "%2C" /
                     "-"   / "."   / "%2F" /
                     DIGIT / "%3A" / "%3B" / "%3C" /
                     "%3D" / "%3E" / "%3F" / "%40" /
                     ALPHA / "%5B"         / "%5D" /
                     "%5E" / "_"   / "%60" / "%7B" /
                     "%7C" / "%7D" / "~"   / "%7F" /
                     "%5C%22" / "%5C%5C"

In the usual cases (in which the only punctuation comprises the
hyphen-minus, period, and underscore) the choice between the syntaxes
has no effect. It is when an unusual e-mail address comes into play that
the choice determines legibility:

E-mail address: "No+spam!_(Get_it?)"@d.example
“tag” <emailAddress>: 'No+spam!_(Get_it?)'@d.example
“mailto” URI: mailto:%22No%2Bspam%21_%28Get_it%3F%29%22@d.example

-- 
Etan Wexler.
Received on Saturday, 9 July 2005 19:01:02 UTC