- From: Etan Wexler <ewexler@stickdog.com>
- Date: Sat, 09 Jul 2005 15:03:48 -0400
- To: Tim Kindberg <timothy@hpl.hp.com>, sandro hawke <sandro@w3.org>, URI Interest Group <uri@w3.org>
Tim Kindberg wrote to the URI-Interest-Group list (<mailto:uri@w3.org>) on 6 July 2005 in “email address in a URI” (<mid:42CBAAE0.3060309@hpl.hp.com>, <http://www.w3.org/mid/42CBAAE0.3060309@hpl.hp.com>): > I'd appreciate comments on the following [replacement] within > tag syntax, and the logic behind it: > > emailAddress = dot-atom-text-uri "@" DNSname > dot-atom-text-uri = 1*atext-uri * ("." 1*atext-uri) > atext-uri= ALPHA / DIGIT / ; see RFC 2822 > "!" / "$" / ; only URI-compatible > "&" / "'" / ; characters included > "*" / "+" / > "-" / "/" / > "=" / "_" / > "~" > > The logic behind the above is: > 0. Avoid obsolete local parts, and local parts involving CFWS > (comments and white space, which could introduce ambiguity in who can > use which email addresses) The form <quoted-string> is neither listed as obsolete nor is obsolete in practice. The “tag” scheme should account for <quoted-string> local parts. > 3. Can't %-encode characters without ambiguity, since RFC 2822 > allows email addresses containing % HEX HEX constructs The ambiguity is zero. Any percent signs in the local part must be percent-encoded. Consider an e-mail address with a percent sign: 100%BAT@mail.example Now consider what should be a valid “tag” URI based on that address: tag:100%25BAT@mail.example,2005-07-09:unambiguous > 4. So we have to avoid / "^" / "`" / "{" / "|" / "}" Certainly we avoid those characters in literal form. > 5. And it seems a bad idea to allow "#" / "%" / "?" If they are percent-encoded, what seems a bad idea? The outstanding issue that I detect is compatibility with the syntax of the “mailto” scheme. The “mailto” syntax in RFC 2368 and in the pending revision of RFC 2368 has a wider scope than the “tag” scheme’s <emailAddress> has. The “mailto” syntax intends to capture all variations in Internet-message addressing. (That is my reading of the specifications, anyway.) The “mailto” scheme includes a query component in which at least the ampersand and “equals” sign are reserved. The “tag” specification could allow the literal appearance of those characters in e-mail addresses if cross-scheme compatibility is not a concern. I suggest that the “tag” scheme avoid syntactic compatibility with the “mailto” scheme. The result of the break will be a more-readable syntax better suited for the “tag” scheme’s narrower scope. If the “tag” scheme sets on its own route for e-mail addresses, the following rules are in order. emailAddress = tag-local-part "@" DNSname tag-local-part = tag-dot-atom-text / tag-no-fold-quote tag-dot-atom-text = 1*tag-atext *("." 1*tag-atext) tag-atext = ALPHA / DIGIT / "!" / "%23" / "$" / "%25" / "&" / "%27" / "*" / "+" / "-" / "%2F" / "=" / "%3F" / "%5E" / "_" / "%60" / "%7B" / "%7C" / "%7D" / "~" tag-no-fold-quote = "'" *tag-qtext "'" tag-qtext = "%01" / "%02" / "%03" / "%04" / "%05" / "%06" / "%07" / "%08" / "%09" / "%0B" / "%0C" / "%0E" / "%0F" / "%10" / "%11" / "%12" / "%13" / "%14" / "%15" / "%16" / "%17" / "%18" / "%19" / "%1A" / "%1B" / "%1C" / "%1D" / "%1E" / "%1F" / "%20" / "!" / "%22" / "%23" / "$" / "%25" / "&" / "%27" / "(" / ")" / "*" / "+" / "%2C" / "-" / "." / "%2F" / DIGIT / "%3A" / ";" / "%3C" / "=" / "%3E" / "%3F" / "%40" / ALPHA / "%5B" / "%5C" / "%5D" / "%5E" / "_" / "%60" / "%7B" / "%7C" / "%7D" / "~" / "%7F" A <tag-local-part> construct has the following characteristics. * Any non-obsolete local part of an e-mail address is representable. (Note: an e-mail address as described here is not the same construct as an RFC-2822 <addr-spec>. The latter may include comments and whitespace which have no operational semantics. To derive an e-mail address, one removes comments and ignorable whitespace from an <addr-spec>.) * Characters illegal in URIs — control characters, space, characters that delimit URIs from other text, and historically URI-unwise characters — never appear literally. Their representations are percent-encodings. * Characters with a reserved meaning specified in RFC 3986 — the characters specified by the <gen-delim> rule — never appear literally. Their representations are percent-encodings. * If the colon were to appear literally, it would confuse human consumers of “tag” URIs; the colon already has a reserved meaning within “tag” URIs, separating the <taggingEntity> from the <specific> part. * If the slash were to appear literally, it would introduce a semantics of hierarchy where none belongs. * If the question mark were to appear literally, it would start a query where none belongs. * If the number sign were to appear literally, it would start a fragment identifier where none belongs. * RFC 3986 permits the square brackets in URIs only as delimiters for Internet-Protocol-address literals. * If the commercial “at” were to appear literally, it would confuse human consumers of “tag” URIs; the commercial “at” already has a reserved meaning within “tag” URIs, separating the <tag-local-part> from the <DNSname>. * Apostrophes, tentatively reserved within URIs by RFC 3986, serve the function of the <DQUOTE> quotation mark, reserved within e-mail addresses by RFC 2822. Thus apostrophes delimit the representation of a string within a <tag-local-part>. * Commas never appear literally. Their representations are percent-encodings. If the comma were to appear literally, it would confuse human consumers of “tag” URIs; the comma already has a reserved meaning within “tag” URIs, separating the <authorityName> from the <date>. * There is neither the necessity nor the possibility of percent-escaping <quoted-pair> constructs from RFC 2822. Instead, a <tag-no-fold-quote> construct deals with the content represented by an RFC-2822 <quoted-string> construct, not with the representation. The transformation algorithm is as follows. Given input of a string conforming to the <quoted-string> rule of RFC 2822 and given output that starts as an empty string: 1. Append an apostrophe (U+0027) to the output. 2. Remove the delimiting <DQUOTE> quotation marks (U+0022) from the <quoted-string>. 3. Iterating over each remaining input character, from first to last: 3.1. If the current character is a member of the set {U+000D, U+000A}: 3.1.1. Skip to the next iteration. 3.2. If the current character is a backslash (U+005C): 3.2.1. Retrieve the next character. The character retrieved is now the current character. 3.3. If the current character is a member of the set {U+0021, U+0024, U+0026, U+0028, U+0029, U+002A, U+002B, U+002D, U+002E, U+0030, U+0031, U+0032, U+0033, U+0034, U+0035, U+0036, U+0037, U+0038, U+0039, U+003B, U+003D, U+0041, U+0042, U+0043, U+0044, U+0045, U+0046, U+0047, U+0048, U+0049, U+004A, U+004B, U+004C, U+004D, U+004E, U+004F, U+0050, U+0051, U+0052, U+0053, U+0054, U+0055, U+0056, U+0057, U+0058, U+0059, U+005A, U+005F, U+0061, U+0062, U+0063, U+0064, U+0065, U+0066, U+0067, U+0068, U+0069, U+006A, U+006B, U+006C, U+006D, U+006E, U+006F, U+0070, U+0071, U+0072, U+0073, U+0074, U+0075, U+0076, U+0077, U+0078, U+0079, U+007A, U+007E}: 3.3.1. Append the current character to the output. 3.4. If the current character is a member of the set {U+0001, U+0002, U+0003, U+0004, U+0005, U+0006, U+0007, U+0008, U+0009, U+000B, U+000C, U+000E, U+000F, U+0010, U+0011, U+0012, U+0013, U+0014, U+0015, U+0016, U+0017, U+0018, U+0019, U+001A, U+001B, U+001C, U+001D, U+001E, U+001F, U+0020, U+0022, U+0023, U+0025, U+0027, U+002C, U+002F, U+003A, U+003C, U+003E, U+003F, U+0040, U+005B, U+005C, U+005D, U+005E, U+0060, U+007B, U+007C, U+007D, U+007F}: 3.4.1. Append a percent sign (U+0025) to the output. 3.4.2. Append to the output the two hexadecimal digits (U+0030 – U+0039, U+0041 – U+0046) representing the possibly-zero-padded integer that is the current character’s code point, most-significant digit first. 4. Append an apostrophe (U+0027) to the output. If the “tag” scheme preserves syntactic compatibility with the “mailto” scheme, I believe that the following rules are in order. emailAddress = uri-local-part "@" DNSname uri-local-part = uri-dot-atom-text / uri-no-fold-quote uri-dot-atom-text = 1*uri-atext *("." 1*uri-atext) uri-atext = ALPHA / DIGIT / "%21" / "%23" / "%24" / "%25" / "%26" / "%27" / "%2A" / "%2B" / "-" / "%2F" / "%3D" / "%3F" / "%5E" / "_" / "%60" / "%7B" / "%7C" / "%7D" / "~" uri-no-fold-quote = "%22" *uri-qtext "%22" uri-qtext = "%01" / "%02" / "%03" / "%04" / "%05" / "%06" / "%07" / "%08" / "%09" / "%0B" / "%0C" / "%0E" / "%0F" / "%10" / "%11" / "%12" / "%13" / "%14" / "%15" / "%16" / "%17" / "%18" / "%19" / "%1A" / "%1B" / "%1C" / "%1D" / "%1E" / "%1F" / "%20" / "%21" / "%23" / "%24" / "%25" / "%26" / "%27" / "%28" / "%29" / "%2A" / "%2B" / "%2C" / "-" / "." / "%2F" / DIGIT / "%3A" / "%3B" / "%3C" / "%3D" / "%3E" / "%3F" / "%40" / ALPHA / "%5B" / "%5D" / "%5E" / "_" / "%60" / "%7B" / "%7C" / "%7D" / "~" / "%7F" / "%5C%22" / "%5C%5C" In the usual cases (in which the only punctuation comprises the hyphen-minus, period, and underscore) the choice between the syntaxes has no effect. It is when an unusual e-mail address comes into play that the choice determines legibility: E-mail address: "No+spam!_(Get_it?)"@d.example “tag” <emailAddress>: 'No+spam!_(Get_it?)'@d.example “mailto” URI: mailto:%22No%2Bspam%21_%28Get_it%3F%29%22@d.example -- Etan Wexler.
Received on Saturday, 9 July 2005 19:01:02 UTC