diffs from draft-fielding-url-syntax-02.txt

Larry Masinter (masinter@parc.xerox.com)
Sun, 29 Dec 1996 01:37:22 PST


To: uri@bunyip.com
Subject: diffs from draft-fielding-url-syntax-02.txt
From: Larry Masinter <masinter@parc.xerox.com>
Message-Id: <96Dec29.023722pdt."248"@palimpsest.parc.xerox.com>
Date: Sun, 29 Dec 1996 01:37:22 PST

diff draft-fielding-url-syntax-02.txt draft-ietf-url-syntax-00.txt
================================================================
1,2d0
< 
< 
5c3
< <draft-fielding-url-syntax-02>                               R. Fielding
---
> <draft-ietf-url-syntax-00>                                   R. Fielding
9,10c7
< 
<                                                         07 December 1996
---
>                                                         29 December 1996
15d11
< 
38,41c34,43
<       2. Section 6 (New URL Schemes) needs input from the Applications
<          Area A.D.'s.
< 
< 
---
>       2. Need a specific reference to the documents
>          defining Content-Base and Content-Language.
>       3. Examples should include one with multiple parameters and
>          one with multiple queries.
>       4. Suggestion to include a 'normalization' algorithm. Should we?
>       5. Is there semantics to empty fragment identifiers?
>       6. clarify issue with http://4kids/blah, where non FQDN is used.
>       7. Add [MHTML] reference   
>       8. URN/URI/URL issue
>    
48,49c50,51
<    for their use and for the definition of new URL schemes.  It revises
<    and replaces the generic definitions in RFC 1738 and RFC 1808.
---
>    for their use. It revises and replaces the generic definitions in
>    RFC 1738 and RFC 1808.
51d52
< 
61c62
<    Recommendations for Internet Resource Locators", RFC 1736 [8].
---
>    Recommendations for Internet Resource Locators", RFC 1736 [9].
64c65
<    [2] and RFC 1808 "Relative Uniform Resource Locators" [7] in order to
---
>    [2] and RFC 1808 "Relative Uniform Resource Locators" [6] in order to
67c68,70
<    URL schemes; those portions will be updated as separate documents.
---
>    URL schemes; those portions will be updated as separate documents,
>    as will the process for registration of new URL schemes.
> 
115c118
<    fashion (see RFC 1737, [10]).  URNs are defined by a separate set of
---
>    fashion (see RFC 1737, [11]).  URNs are defined by a separate set of
128c131
<    ftp://ds.internic.net/rfc/rfc1808.txt
---
>    ftp://ftp.is.co.za/rfc/rfc1808.txt
134c137
<    http://www.ics.uci.edu/pub/ietf/uri/
---
>    http://www.math.uio.no/faq/compression-faq/part1.html
137c140
<    mailto:masinter@parc.xerox.com
---
>    mailto:mduerst@ifi.unizh.ch
146,147c149
<    Many other URL schemes have been defined.  Section 6 describes how
<    new schemes are defined and registered.
---
>    Many other URL schemes have been defined.
161,162c163,164
<    The URL syntax has been designed to promote transcribability over all
<    other concerns.  A URL is a sequence of characters, i.e., letters,
---
>    The URL syntax has been designed to promote transcribability as one
>    of its main concerns. A URL is a sequence of characters, i.e., letters,
185,186c187,188
<          keyboards (and related input devices) across nationalities and
<          languages.
---
>          keyboards (and related input devices) across languages and
>          locales.
195c197,198
<    In such cases, the ability to access a resource is considered more
---
>    The ability to transcribe the resource
>    location from one medium to another was considered more
198a202,205
>    In a few cases, exceptions were made for characters already in
>    widespread use within URLs: the "~", "$" and "#" characters might
>    have otherwise been excluded from URLs.
> 
214c221
<    formal URL syntax.  The grammar is that of RFC 822 [6], except that
---
>    formal URL syntax.  The grammar is that of RFC 822 [5], except that
234c241
<       alpha    = lowalpha | hialpha
---
>       alpha    = lowalpha | upalpha
240c247
<       hialpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
---
>       upalpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
248a256
> 
254,260c262,278
<    All URLs consist of a restricted set of characters, chosen to
<    maximize their transcribability and usability across varying computer
<    systems, natural languages, and nationalities.  This restricted set
<    corresponds to a subset of the graphic printable characters of the
<    US-ASCII coded character set [11].
< 
<    The set of characters allowed for use within URLs can be described in
---
>    All URLs consist of a restricted set of characters, primarily chosen
>    to aid transcribability and usability both in computer
>    systems and in non-computer communications. In addition, characters
>    used conventionally as delimiters around URLs were excluded.  The
>    restricted set of characters consists of digits, letters, and a few
>    graphic symbols corresponding to a subset of the graphic printable
>    characters of the US-ASCII coded character set [12]; they are
>    common to most of the character encodings and input facilities
>    available to Internet users.
> 
>    Within a URL, characters are either used as delimiters, or to
>    represent strings of data (octets) within delimited portions.  When
>    used to represent data directly, the character denotes the octet
>    corresponding to the US-ASCII code for that character.  In
>    addition, an octet may be represented by an escaped encoding.
>    
>    Thus, the set of "characters" allowed within URLs can be described in
263c281
<       urlchar     = reserved | unreserved | escaped
---
>       urlc        = reserved | unreserved | escaped
264a283,308
> 1.5. Characters, octets, and encodings
> 
>    URLs are sequences of characters. Parts of those sequences of
>    characters are then used to represent sequences of octets. In turn,
>    sequences of octets are (frequently) used (with a character
>    encoding scheme) to represent characters. This means that when
>    dealing with URLs it's necessary to work at three levels:
> 
>                      represented characters
>                                 ^
>                                 |
>                                 v
>                               octets
>                                 ^
>                                 |
>                                 v
>                          URL characters
> 
>    This looks more complicated than necessary if all one is dealing
>    with is file names in ASCII, but is necessary when dealing with the
>    wide variety of systems in use. URL characters may represent octets
>    directly or with escape sequences (Section 2.3). Octets may
>    sometimes represent characters in ASCII, in other character
>    encodings, or sometimes be used to represent data that does not
>    correspond to characters at all.
> 
270,271c314,315
<    purpose.  If the data characters for a URL component would conflict
<    with the reserved purpose, then the conflicting characters must be
---
>    purpose.  If the data for a URL component would conflict
>    with the reserved purpose, then the conflicting data must be
276c320
<    This specification uses the "reserved" set to refer to those
---
>    The "reserved" syntax class above refers to those
281,284c325,329
<    Characters in the "reserved" set are not always reserved.  The set of
<    characters actually reserved within any given URL component is
<    defined by that component.  In general, a character is reserved if
<    escaping that character would change the semantics of the URL.
---
>    Characters in the "reserved" set are not reserved in all contexts.
>    The set of characters actually reserved within any given URL
>    component is defined by that component. In general, a character is
>    reserved if the semantics of the URL changes if the character is
>    replaced with its escaped ASCII encoding.
290,291c335,336
<    letters, decimal digits, and a subset of the punctuation marks and
<    symbols found in US-ASCII.
---
>    letters, decimal digits, and a limited set of punctuation marks and
>    symbols.
293c338
<       unreserved  = alpha | digit | mark
---
>       unreserved  = alphanum | mark
302c347
< 2.3. Escaped Characters
---
> 2.3. Escape Sequences 
304,310c349,353
<    A character must be escaped if it is non-printable, if it is often
<    used to delimit a URL from its context, if it is not found in
<    the US-ASCII coded character set, if it is known to cause problems
<    when passed through some e-mail gateways, or if it is being used as
<    normal data within a component in which it is reserved.  Other
<    characters should not be escaped unless the context of their use
<    requires it.
---
>    Data must be escaped if it does not have a representation using an
>    unreserved character; this includes data that does not correspond
>    to a printable character of the US-ASCII coded character set, and
>    also data that corresponds to characters used to delimit a URL from
>    its context.
314,318c357,360
<    An escaped character is encoded as a character triplet, consisting of
<    the percent character "%" followed by the two hexadecimal digits
<    representing the character's octet code in an 8-bit coded character
<    set.  For example, "%20" is the escaped encoding for the space
<    character.
---
>    An escaped octet is encoded as a character triplet, consisting
>    of the percent character "%" followed by the two hexadecimal digits
>    representing the octet code. For example, "%20" is the escaped
>    encoding for the US-ASCII space character.
324,338d365
<    The 8-bit coded character set of the octet must be a superset of the
<    US-ASCII coded character set, such that the US-ASCII characters have
<    the same escaped encoding regardless of the larger octet character
<    set.  The coded character set chosen must correspond to the character
<    set of the mechanism that will interpret the URL component in which
<    the escaped character is used.  A sequence of escape triplets are
<    used if the character is coded as a sequence of octets.
< 
<    Any character, from any character set, can be included in a URL via
<    the escaped encoding, provided that the mechanism which will
<    interpret the URL has an octet encoding for that character.  However,
<    only that mechanism (the originator of the URL) can determine which
<    character is represented by the octet.  A client without knowledge of
<    the origination mechanism cannot unescape the character for display.
<    
342,343c369,370
<    completed URL might change its semantics.  The only time that
<    characters within a URL can be safely escaped is when the URL is
---
>    completed URL might change its semantics.  Normally, the only time
>    escape encodings can safely be made is when the URL is
348c375
<    semantics.  Likewise, a URL must be separated into its components
---
>    semantics. Likewise, a URL must be separated into its components
350c377,384
<    safely unescaped.
---
>    safely decoded.
> 
>    In some cases, data that could be represented by an unreserved
>    character may appear escaped; for example, some of the unreserved
>    mark characters are automatically escaped by some systems. It
>    is safe to unescape these within the body of a URL. 
>    For example, "%7e" is sometimes used instead of "~" in http URL
>    path, but the two can be used interchangably.
360,368d393
<    An exception to the unescaping rules is allowed when it is known that
<    some older systems are escaping a character that does not need to be
<    escaped, and when it is possible to reliably discriminate between
<    such an escaped data character and any reserved use for that
<    character.  For example, it is generally safe to unescape "%7e" when
<    it occurs near the beginning of an http URL path, since many older
<    systems automatically escape the "~" character even though it is
<    unreserved.
< 
372,373c397,398
<    description of those characters which have been excluded and the
<    reasons for their exclusion.
---
>    description of those US-ASCII characters which have been excluded
>    and the reasons for their exclusion.
396c421
<    references.  The percent character "%" is excluded because it is used
---
>    references (Section 3). The percent character "%" is excluded because it is used
402c427,428
<    agents are known to sometimes modify such characters.
---
>    agents are known to sometimes modify such characters, or they are
>    used as delimiters.
413,417c439,444
<    Excluded characters must be escaped in order to be properly
<    represented within a URL.  However, there do exist some systems that
<    allow characters from the "unwise" and "national" sets to be used in
<    URL references; a robust implementation should be prepared to handle
<    those characters when it is possible to do so.
---
>    Data corresponding to excluded characters must be escaped in order
>    to be properly represented within a URL.  However, there do exist
>    some systems that allow characters from the "unwise" and "national"
>    sets to be used in URL references (section 3); a robust
>    implementation should be prepared to handle those characters when
>    it is possible to do so.
425c452
<    be attached to additional information in the form of a fragment
---
>    have additional information attached in the form of a fragment
449c476
<    media type of the retrieved resource.
---
>    media type of the resource referenced by the URL.
451c478
<       fragment      = *urlchar
---
>       fragment      = *urlc
501c528
<       opaque-URL    = scheme ":" *urlchar
---
>       opaque-URL    = scheme ":" *urlc
506,507c533,534
<    separating hierarchical components.  For some file systems, the "/"
<    used to denote the hierarchical structure of a URL corresponds to the
---
>    separating hierarchical components.  For some file systems, a "/"
>    character (used to denote the hierarchical structure of a URL) is the
569c596
<    [9] and Section 2.1 of RFC 1123 [5]: a sequence of domain labels
---
>    [10] and Section 2.1 of RFC 1123 [4]: a sequence of domain labels
611c638
<       query         = *urlchar
---
>       query         = *urlc
745c772
<    Messages are considered to be composite documents.  The base URL of a
---
>    MIME messages [7] are considered to be composite documents.  The base URL of a
748c775
<    of message headers like those described in MIME [4], the base URL
---
>    of message headers like those described in MIME [7], the base URL
789c816
<    media types defined by MIME (RFC 1521, [4]), define a hierarchy of
---
>    media types defined by MIME[8], define a hierarchy of
940,970c967
< 6. Adding New Schemes
< 
<    The Internet Assigned Numbers Authority (IANA) maintains a registry
<    of URL schemes.
< 
<    The current process for defining URL schemes is via the Internet
<    standards process: new URL schemes should be described in
<    standards-track RFCs.  Over time, other methods of registering URL
<    schemes may be added.
< 
<    URL schemes must have demonstrable utility and operability.  One way
<    to provide such a demonstration is via a gateway which provides
<    objects in the new scheme for clients using an existing protocol.  If
<    the new scheme does not locate resources that are data objects, the
<    properties of names in the new space must be clearly defined.
< 
<    URL schemes should follow the same syntactic conventions of existing
<    schemes when appropriate.  URL schemes should use the generic-URL
<    syntax if they are intended to be used with relative URLs.  A
<    description of the allowed relative forms should be included in the
<    scheme's definition.
< 
<    URL schemes cannot redefine the algorithm for resolving relative
<    references.  The resolution algorithm must remain independent of the
<    scheme name in order to preserve the mobility of relative references
<    between naming schemes and the ability to parse and resolve a
<    relative reference without knowing the properties of any particular
<    scheme.
< 
< 
< 7. Security Considerations
---
> 6. Security Considerations
990,991c987,990
<    operation.  An example has been the use of gopher URLs to cause a rude
<    message to be sent via a SMTP server.  Caution should be used when
---
>    operation.  An example has been the use of gopher URLs to cause an
>    unintended or impersonating message to be sent via a SMTP server.
>    
>    Caution should be used when
1007,1008c1006
< 
< 8. Acknowledgements
---
> 7. Acknowledgements
1010c1008
<    This document was derived from RFC 1738 [2] and RFC 1808 [7]; the
---
>    This document was derived from RFC 1738 [2] and RFC 1808 [6]; the
1012,1015c1010,1013
<    this draft has benefited from comments by Lauren Wood.
< 
< 
< 9. References
---
>    contributions by Lauren Wood, Martin Duerst, Gisle Aas, Martijn
>    Koster, Ryan Moats and Foteos Macrides are gratefully acknowledged.
>    
> 8. References
1029,1034c1027
<    [4] Borenstein, N., and N. Freed, "MIME (Multipurpose Internet Mail
<        Extensions): Mechanisms for Specifying and Describing the Format
<        of Internet Message Bodies", RFC 1521, Bellcore, Innosoft,
<        September 1993.
< 
<    [5] Braden, R., Editor, "Requirements for Internet Hosts --
---
>    [4] Braden, R., Editor, "Requirements for Internet Hosts --
1037c1030
<    [6] Crocker, D., "Standard for the Format of ARPA Internet Text
---
>    [5] Crocker, D., "Standard for the Format of ARPA Internet Text
1040c1033
<    [7] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
---
>    [6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
1043c1036,1044
<    [8] Kunze, J., "Functional Recommendations for Internet Resource
---
>    [7] N. Freed & N. Borenstein, "Multipurpose Internet Mail
>        Extensions (MIME) Part One: Format of Internet Message Bodies,"
>        RFC 2045, November 1996.
> 
>    [8] Freed, N., and N. Freed, "Multipurpose Internet Mail
>        Extensions (MIME): Part Two: Media Types", RFC 2046, Innosoft, Bellcore,
>        November 1996.
> 
>    [9] Kunze, J., "Functional Recommendations for Internet Resource
1046c1047
<    [9] Mockapetris, P., "Domain Names - Concepts and Facilities",
---
>   [10] Mockapetris, P., "Domain Names - Concepts and Facilities",
1050c1051
<   [10] Sollins, K., and L. Masinter, "Functional Requirements for
---
>   [11] Sollins, K., and L. Masinter, "Functional Requirements for
1054c1055
<   [11] US-ASCII. "Coded Character Set -- 7-bit American Standard Code
---
>   [12] US-ASCII. "Coded Character Set -- 7-bit American Standard Code
1058c1059
< 10. Authors' Addresses
---
> 9. Authors' Addresses
1094c1095
<       opaque-URL    = scheme ":" *urlchar
---
>       opaque-URL    = scheme ":" *urlc
1121c1122
<       query         = *urlchar
---
>       query         = *urlc
1123c1124
<       fragment      = *urlchar
---
>       fragment      = *urlc
1125c1126
<       urlchar       = reserved | unreserved | escaped
---
>       urlc          = reserved | unreserved | escaped
1136c1137
<       alpha         = lowalpha | hialpha
---
>       alpha         = lowalpha | upalpha
1141c1142
<       hialpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
---
>       upalpha  = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
1160,1161c1161,1162
<       ^(([^/?#]+):)?(//([^/?#]*))?([^?#]*)?(\?([^#]*))?(#(.*))?
<        12           3  4          5        6  7        8 9
---
>       ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
>        12            3  4          5       6  7        8 9
1328,1336c1329,1337
<    The prefix "URL:", with or without a trailing space, is sometimes
<    used to help distinguish a URL from normal text.  These wrappers do
<    not form part of the URL.  In the case where a fragment identifier is
<    associated with a URL reference, the fragment would be placed within
<    the brackets as well (separated from the URL with a "#" character).
< 
<    In some cases, extra whitespace (spaces, linebreaks, tabs, etc.) may
<    need to be added to break long URLs across lines.  The whitespace
<    should be ignored when extracting the URL.
---
>    These wrappers do not form part of the URL.
> 
>    In the case where a fragment identifier is associated with a URL
>    reference, the fragment would be placed within the brackets as well
>    (separated from the URL with a "#" character).
> 
>    In some cases, extra whitespace (spaces, linebreaks, tabs, etc.)
>    may need to be added to break long URLs across lines. The
>    whitespace should be ignored when extracting the URL. 
1344a1346,1356
>    Using <> angle brackets around each URL is especially recommended
>    as a delimiting style for URLs that contain whitespace.
> 
>    The prefix "URL:" (with or without a trailing space) was
>    recommended as a way to used to help distinguish a URL from other
>    bracketed designators, although this is not common in pratice.
>    
>    For robustness, software that accepts user-typed URLs should
>    attempt to recognize and strip both delimiters and embedded
>    whitespace.
> 
1453c1465
<    HTTP/1.1 and MHTML.
---
>    HTTP/1.1 and MHTML.[palme]