- From: Etan Wexler <ewexler@stickdog.com>
- Date: Sat, 09 Jul 2005 15:03:48 -0400
- To: Tim Kindberg <timothy@hpl.hp.com>, sandro hawke <sandro@w3.org>, URI Interest Group <uri@w3.org>
Tim Kindberg wrote to the URI-Interest-Group list (<mailto:uri@w3.org>)
on 6 July 2005 in “email address in a URI”
(<mid:42CBAAE0.3060309@hpl.hp.com>,
<http://www.w3.org/mid/42CBAAE0.3060309@hpl.hp.com>):
> I'd appreciate comments on the following [replacement] within
> tag syntax, and the logic behind it:
>
> emailAddress = dot-atom-text-uri "@" DNSname
> dot-atom-text-uri = 1*atext-uri * ("." 1*atext-uri)
> atext-uri= ALPHA / DIGIT / ; see RFC 2822
> "!" / "$" / ; only URI-compatible
> "&" / "'" / ; characters included
> "*" / "+" /
> "-" / "/" /
> "=" / "_" /
> "~"
>
> The logic behind the above is:
> 0. Avoid obsolete local parts, and local parts involving CFWS
> (comments and white space, which could introduce ambiguity in who can
> use which email addresses)
The form <quoted-string> is neither listed as obsolete nor is obsolete
in practice. The “tag” scheme should account for <quoted-string> local
parts.
> 3. Can't %-encode characters without ambiguity, since RFC 2822
> allows email addresses containing % HEX HEX constructs
The ambiguity is zero. Any percent signs in the local part must be
percent-encoded. Consider an e-mail address with a percent sign:
100%BAT@mail.example
Now consider what should be a valid “tag” URI based on that address:
tag:100%25BAT@mail.example,2005-07-09:unambiguous
> 4. So we have to avoid / "^" / "`" / "{" / "|" / "}"
Certainly we avoid those characters in literal form.
> 5. And it seems a bad idea to allow "#" / "%" / "?"
If they are percent-encoded, what seems a bad idea?
The outstanding issue that I detect is compatibility with the syntax of
the “mailto” scheme. The “mailto” syntax in RFC 2368 and in the pending
revision of RFC 2368 has a wider scope than the “tag” scheme’s
<emailAddress> has. The “mailto” syntax intends to capture all
variations in Internet-message addressing. (That is my reading of the
specifications, anyway.) The “mailto” scheme includes a query component
in which at least the ampersand and “equals” sign are reserved. The
“tag” specification could allow the literal appearance of those
characters in e-mail addresses if cross-scheme compatibility is not a
concern.
I suggest that the “tag” scheme avoid syntactic compatibility with the
“mailto” scheme. The result of the break will be a more-readable syntax
better suited for the “tag” scheme’s narrower scope. If the “tag” scheme
sets on its own route for e-mail addresses, the following rules are in
order.
emailAddress = tag-local-part "@" DNSname
tag-local-part = tag-dot-atom-text / tag-no-fold-quote
tag-dot-atom-text = 1*tag-atext *("." 1*tag-atext)
tag-atext = ALPHA / DIGIT /
"!" / "%23" /
"$" / "%25" /
"&" / "%27" /
"*" / "+" /
"-" / "%2F" /
"=" / "%3F" /
"%5E" / "_" /
"%60" / "%7B" /
"%7C" / "%7D" /
"~"
tag-no-fold-quote = "'" *tag-qtext "'"
tag-qtext = "%01" / "%02" / "%03" / "%04" /
"%05" / "%06" / "%07" / "%08" /
"%09" / "%0B" / "%0C" /
"%0E" / "%0F" / "%10" /
"%11" / "%12" / "%13" / "%14" /
"%15" / "%16" / "%17" / "%18" /
"%19" / "%1A" / "%1B" / "%1C" /
"%1D" / "%1E" / "%1F" / "%20" /
"!" / "%22" / "%23" / "$" /
"%25" / "&" / "%27" / "(" /
")" / "*" / "+" / "%2C" /
"-" / "." / "%2F" /
DIGIT / "%3A" / ";" / "%3C" /
"=" / "%3E" / "%3F" / "%40" /
ALPHA / "%5B" / "%5C" / "%5D" /
"%5E" / "_" / "%60" / "%7B" /
"%7C" / "%7D" / "~" / "%7F"
A <tag-local-part> construct has the following characteristics.
* Any non-obsolete local part of an e-mail address is representable.
(Note: an e-mail address as described here is not the same construct as
an RFC-2822 <addr-spec>. The latter may include comments and whitespace
which have no operational semantics. To derive an e-mail address, one
removes comments and ignorable whitespace from an <addr-spec>.)
* Characters illegal in URIs — control characters, space, characters
that delimit URIs from other text, and historically URI-unwise
characters — never appear literally. Their representations are
percent-encodings.
* Characters with a reserved meaning specified in RFC 3986 — the
characters specified by the <gen-delim> rule — never appear literally.
Their representations are percent-encodings.
* If the colon were to appear literally, it would confuse human
consumers of “tag” URIs; the colon already has a reserved meaning within
“tag” URIs, separating the <taggingEntity> from the <specific> part.
* If the slash were to appear literally, it would introduce a
semantics of hierarchy where none belongs.
* If the question mark were to appear literally, it would start a
query where none belongs.
* If the number sign were to appear literally, it would start a
fragment identifier where none belongs.
* RFC 3986 permits the square brackets in URIs only as delimiters for
Internet-Protocol-address literals.
* If the commercial “at” were to appear literally, it would confuse
human consumers of “tag” URIs; the commercial “at” already has a
reserved meaning within “tag” URIs, separating the <tag-local-part> from
the <DNSname>.
* Apostrophes, tentatively reserved within URIs by RFC 3986, serve the
function of the <DQUOTE> quotation mark, reserved within e-mail
addresses by RFC 2822. Thus apostrophes delimit the representation of a
string within a <tag-local-part>.
* Commas never appear literally. Their representations are
percent-encodings. If the comma were to appear literally, it would
confuse human consumers of “tag” URIs; the comma already has a reserved
meaning within “tag” URIs, separating the <authorityName> from the <date>.
* There is neither the necessity nor the possibility of percent-escaping
<quoted-pair> constructs from RFC 2822. Instead, a <tag-no-fold-quote>
construct deals with the content represented by an RFC-2822
<quoted-string> construct, not with the representation. The
transformation algorithm is as follows.
Given input of a string conforming to the <quoted-string> rule of RFC
2822 and given output that starts as an empty string:
1. Append an apostrophe (U+0027) to the output.
2. Remove the delimiting <DQUOTE> quotation marks (U+0022) from the
<quoted-string>.
3. Iterating over each remaining input character, from first to last:
3.1. If the current character is a member of the set {U+000D, U+000A}:
3.1.1. Skip to the next iteration.
3.2. If the current character is a backslash (U+005C):
3.2.1. Retrieve the next character. The character retrieved is now
the current character.
3.3. If the current character is a member of the set {U+0021,
U+0024, U+0026, U+0028, U+0029, U+002A, U+002B, U+002D, U+002E, U+0030,
U+0031, U+0032, U+0033, U+0034, U+0035, U+0036, U+0037, U+0038, U+0039,
U+003B, U+003D, U+0041, U+0042, U+0043, U+0044, U+0045, U+0046, U+0047,
U+0048, U+0049, U+004A, U+004B, U+004C, U+004D, U+004E, U+004F, U+0050,
U+0051, U+0052, U+0053, U+0054, U+0055, U+0056, U+0057, U+0058, U+0059,
U+005A, U+005F, U+0061, U+0062, U+0063, U+0064, U+0065, U+0066, U+0067,
U+0068, U+0069, U+006A, U+006B, U+006C, U+006D, U+006E, U+006F, U+0070,
U+0071, U+0072, U+0073, U+0074, U+0075, U+0076, U+0077, U+0078, U+0079,
U+007A, U+007E}:
3.3.1. Append the current character to the output.
3.4. If the current character is a member of the set {U+0001,
U+0002, U+0003, U+0004, U+0005, U+0006, U+0007, U+0008, U+0009, U+000B,
U+000C, U+000E, U+000F, U+0010, U+0011, U+0012, U+0013, U+0014, U+0015,
U+0016, U+0017, U+0018, U+0019, U+001A, U+001B, U+001C, U+001D, U+001E,
U+001F, U+0020, U+0022, U+0023, U+0025, U+0027, U+002C, U+002F, U+003A,
U+003C, U+003E, U+003F, U+0040, U+005B, U+005C, U+005D, U+005E, U+0060,
U+007B, U+007C, U+007D, U+007F}:
3.4.1. Append a percent sign (U+0025) to the output.
3.4.2. Append to the output the two hexadecimal digits (U+0030 –
U+0039, U+0041 – U+0046) representing the possibly-zero-padded integer
that is the current character’s code point, most-significant digit first.
4. Append an apostrophe (U+0027) to the output.
If the “tag” scheme preserves syntactic compatibility with the “mailto”
scheme, I believe that the following rules are in order.
emailAddress = uri-local-part "@" DNSname
uri-local-part = uri-dot-atom-text / uri-no-fold-quote
uri-dot-atom-text = 1*uri-atext *("." 1*uri-atext)
uri-atext = ALPHA / DIGIT /
"%21" / "%23" /
"%24" / "%25" /
"%26" / "%27" /
"%2A" / "%2B" /
"-" / "%2F" /
"%3D" / "%3F" /
"%5E" / "_" /
"%60" / "%7B" /
"%7C" / "%7D" /
"~"
uri-no-fold-quote = "%22" *uri-qtext "%22"
uri-qtext = "%01" / "%02" / "%03" / "%04" /
"%05" / "%06" / "%07" / "%08" /
"%09" / "%0B" / "%0C" /
"%0E" / "%0F" / "%10" /
"%11" / "%12" / "%13" / "%14" /
"%15" / "%16" / "%17" / "%18" /
"%19" / "%1A" / "%1B" / "%1C" /
"%1D" / "%1E" / "%1F" / "%20" /
"%21" / "%23" / "%24" /
"%25" / "%26" / "%27" / "%28" /
"%29" / "%2A" / "%2B" / "%2C" /
"-" / "." / "%2F" /
DIGIT / "%3A" / "%3B" / "%3C" /
"%3D" / "%3E" / "%3F" / "%40" /
ALPHA / "%5B" / "%5D" /
"%5E" / "_" / "%60" / "%7B" /
"%7C" / "%7D" / "~" / "%7F" /
"%5C%22" / "%5C%5C"
In the usual cases (in which the only punctuation comprises the
hyphen-minus, period, and underscore) the choice between the syntaxes
has no effect. It is when an unusual e-mail address comes into play that
the choice determines legibility:
E-mail address: "No+spam!_(Get_it?)"@d.example
“tag” <emailAddress>: 'No+spam!_(Get_it?)'@d.example
“mailto” URI: mailto:%22No%2Bspam%21_%28Get_it%3F%29%22@d.example
--
Etan Wexler.
Received on Saturday, 9 July 2005 19:01:02 UTC