Re: draft-fielding-url-syntax-05.txt from Martin J. Duerst on 1997-05-12 (uri@w3.org from May 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Mon, 12 May 1997 14:34:43 +0200 (MET DST)
To: Larry Masinter <masinter@parc.xerox.com>
cc: URI mailing list <uri@bunyip.com>
Message-ID: <Pine.SUN.3.96.970512140820.245R-100000@enoshima>
On Fri, 2 May 1997, Larry Masinter wrote:

> Network Working Group                            T. Berners-Lee, MIT/LCS
> INTERNET-DRAFT                                 R. Fielding,  U.C. Irvine
> draft-fielding-url-syntax-05              L. Masinter, Xerox Corporation
> Expires six months after publication date                    May 2, 1997
> 
>     Uniform Resource Locators (URL): Generic Syntax and Semantics

I have finally had time to have a look at this draft. In this mail,
I'll just point out a few corrections that shouldn't cause much
discussion.



> 2.3. Unreserved Characters
> 
>    Data characters which are allowed in a URL but do not have a reserved
>    purpose are called unreserved.  These include upper and lower case
>    letters, decimal digits, and a limited set of punctuation marks and
>    symbols.

The term "data characters" is never defined or explained. I would
suggest to use "URL characters" here, or only "characters".


> 2.4. Escape Sequences
> 
>    Data must be escaped if it does not have a representation using an
>    unreserved character; this includes data that does not correspond
>    to a printable character of the US-ASCII coded character set, or
>    that corresponds to any US-ASCII character that is disallowed, as
>    explained below.

Here, I would suggest to replace "data" by "octets" (two times,
with the appropriate grammatical changes).


> 2.4.1. Escaped Encoding
> 
>    An escaped octet is encoded as a character triplet, consisting
>    of the percent character "%" followed by the two hexadecimal digits
>    representing the octet code.

I think it is better to start this sencence as follows:
"An octet is escaped by encoding it as a character triplet,...".

There was, at some time, a category "escaped character" or
"escaped octet", but it was confusing and has been nicely
removed.



> 2.4.2. When to Escape and Unescape
> 
>    A URL is always in an "escaped" form, since escaping or unescaping
>    a completed URL might change its semantics.  Normally, the only
>    time escape encodings can safely be made is when the URL is being
>    created from its component parts; each component may have its own
>    set of characters which are reserved, so only the mechanism
>    responsible for generating or interpreting that component can
>    determine whether or not escaping a character will change its
>    semantics. Likewise, a URL must be separated into its components
>    before the escaped characters within those components can be safely
>    decoded.
> 
>    In some cases, data that could be represented by an unreserved
>    character may appear escaped; for example, some of the unreserved
>    "mark" characters are automatically escaped by some systems. It is
>    safe to unescape these within the body of a URL.  For example,
>    "%7e" is sometimes used instead of "~" in http URL path, but the
>    two can be used interchangably.
> 
>    Because the percent "%" character always has the reserved purpose of
>    being the escape indicator, it must be escaped as "%25" in order to
>    be used as data within a URL.  Implementers should be careful not to
>    escape or unescape the same string more than once, since unescaping
>    an already unescaped string might lead to misinterpreting a percent
>    data character as another escaped character, or vice versa in the
>    case of escaping an already escaped string.
> 
> 2.4.3. Excluded US-ASCII Characters
> 
>    Although they are disallowed within the URL syntax, we include here
>    a description of those US-ASCII characters which have been excluded
>    and the reasons for their exclusion.
> 
>    The control characters in the US-ASCII coded character set are not
>    use within a URL, both because they are non-printable and because
>    they are likely to be misinterpreted by some control mechanisms.
> 
>    control     = <US-ASCII coded characters 00-1F and 7F hexadecimal>
> 
>    The space character is excluded because significant spaces may
>    disappear and insignificant spaces may be introduced when URLs are
>    transcribed or typeset or subjected to the treatment of
>    word-processing programs.  Whitespace is also used to delimit URLs
>    in many contexts.
>    
>    space       = <US-ASCII coded character 20 hexadecimal>
> 
>    The angle-bracket "<" and ">" and double-quote (") characters are
>    excluded because they are often used as the delimiters around URLs
>    in text documents and protocol fields.  The character "#" is
>    excluded because it is used to delimit a URL from a fragment
>    identifier in URL references (Section 3). The percent character "%"
>    is excluded because it is used for the encoding of escaped
>    characters.
> 
>    delims      = "<" | ">" | "#" | "%" | <">
>    
>    Other characters are excluded because gateways and other transport
>    agents are known to sometimes modify such characters, or they are
>    used as delimiters.
> 
>    unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
> 
>    Data corresponding to excluded characters must be escaped in order
>    to be properly represented within a URL.



> 4. Generic URL Syntax

This may be somewhat disturbing, as it has to be read as
generic URL-Syntax, as opposed to Generic-URL syntax. While
this is distinguished by using a hyphen in the second case,
it could be improved by changing the title of Chapter 4
to "General URL Syntax" or just "URL Syntax", or maybe
something else.



> 4.2. Opaque and Hierarchical URLs

Here we have a similar problem. It's actually opaque and generic
URLs (the later in the sense of generic-URL). Generic-URLs may,
but need not, be hierarchical.


>    The URL syntax does not require that the scheme-specific-part have
>    any general structure or set of semantics which is common among all
>    URLs.  However, a subset of URLs do share a common syntax for
>    representing hierarchical relationships within the locator namespace.
>    This generic-URL syntax is used in interpreting relative URLs.
> 
>       absoluteURL   = generic-URL | opaque-URL
> 
>       opaque-URL    = scheme ":" *urlc
> 
>       generic-URL   = scheme ":" relativeURL
> 
>    URLs which are hierarchical in nature use the slash "/" character for
>    separating hierarchical components.  For some file systems, a "/"
>    character (used to denote the hierarchical structure of a URL) is the
>    delimiter used to construct a file name hierarchy, and thus the URL
>    path will look similar to a file pathname.  This does NOT imply that
>    the URL is a Unix pathname.

The text in this paragraph should probably say that hierarchical
URLs are a subset of gereric-URLs.



> 4.3. URL Syntactic Components
> 
>    The URL syntax is dependent upon the scheme.  Some schemes use
>    reserved characters like "?" and ";" to indicate special components,
>    while others just consider them to be part of the path.  However,
>    most URL schemes use a common sequence of four main components to
>    define the location of a resource

[this is a preparatory note for a later comment:]
We have four components for an *URL* here.


>                      To actually be "Uniform" as a resource locator,
>    a URL hostname should be a fully qualified domain names. In practice,
>    however, the host component may be a local domain literal.

Remove the "s" in "names".


> 4.4. Parsing a URL Reference
> 
>    A URL reference is typically parsed according to the four main
>    components in order to determine what components are present and
>    whether or not the reference is relative or absolute.

URLs have four main components. URL references therefore have five
main components.




Regards,	Martin.
Received on Monday, 12 May 1997 08:37:03 UTC