- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Mon, 12 May 1997 14:34:43 +0200 (MET DST)
- To: Larry Masinter <masinter@parc.xerox.com>
- cc: URI mailing list <uri@bunyip.com>
On Fri, 2 May 1997, Larry Masinter wrote:
> Network Working Group T. Berners-Lee, MIT/LCS
> INTERNET-DRAFT R. Fielding, U.C. Irvine
> draft-fielding-url-syntax-05 L. Masinter, Xerox Corporation
> Expires six months after publication date May 2, 1997
>
> Uniform Resource Locators (URL): Generic Syntax and Semantics
I have finally had time to have a look at this draft. In this mail,
I'll just point out a few corrections that shouldn't cause much
discussion.
> 2.3. Unreserved Characters
>
> Data characters which are allowed in a URL but do not have a reserved
> purpose are called unreserved. These include upper and lower case
> letters, decimal digits, and a limited set of punctuation marks and
> symbols.
The term "data characters" is never defined or explained. I would
suggest to use "URL characters" here, or only "characters".
> 2.4. Escape Sequences
>
> Data must be escaped if it does not have a representation using an
> unreserved character; this includes data that does not correspond
> to a printable character of the US-ASCII coded character set, or
> that corresponds to any US-ASCII character that is disallowed, as
> explained below.
Here, I would suggest to replace "data" by "octets" (two times,
with the appropriate grammatical changes).
> 2.4.1. Escaped Encoding
>
> An escaped octet is encoded as a character triplet, consisting
> of the percent character "%" followed by the two hexadecimal digits
> representing the octet code.
I think it is better to start this sencence as follows:
"An octet is escaped by encoding it as a character triplet,...".
There was, at some time, a category "escaped character" or
"escaped octet", but it was confusing and has been nicely
removed.
> 2.4.2. When to Escape and Unescape
>
> A URL is always in an "escaped" form, since escaping or unescaping
> a completed URL might change its semantics. Normally, the only
> time escape encodings can safely be made is when the URL is being
> created from its component parts; each component may have its own
> set of characters which are reserved, so only the mechanism
> responsible for generating or interpreting that component can
> determine whether or not escaping a character will change its
> semantics. Likewise, a URL must be separated into its components
> before the escaped characters within those components can be safely
> decoded.
>
> In some cases, data that could be represented by an unreserved
> character may appear escaped; for example, some of the unreserved
> "mark" characters are automatically escaped by some systems. It is
> safe to unescape these within the body of a URL. For example,
> "%7e" is sometimes used instead of "~" in http URL path, but the
> two can be used interchangably.
>
> Because the percent "%" character always has the reserved purpose of
> being the escape indicator, it must be escaped as "%25" in order to
> be used as data within a URL. Implementers should be careful not to
> escape or unescape the same string more than once, since unescaping
> an already unescaped string might lead to misinterpreting a percent
> data character as another escaped character, or vice versa in the
> case of escaping an already escaped string.
>
> 2.4.3. Excluded US-ASCII Characters
>
> Although they are disallowed within the URL syntax, we include here
> a description of those US-ASCII characters which have been excluded
> and the reasons for their exclusion.
>
> The control characters in the US-ASCII coded character set are not
> use within a URL, both because they are non-printable and because
> they are likely to be misinterpreted by some control mechanisms.
>
> control = <US-ASCII coded characters 00-1F and 7F hexadecimal>
>
> The space character is excluded because significant spaces may
> disappear and insignificant spaces may be introduced when URLs are
> transcribed or typeset or subjected to the treatment of
> word-processing programs. Whitespace is also used to delimit URLs
> in many contexts.
>
> space = <US-ASCII coded character 20 hexadecimal>
>
> The angle-bracket "<" and ">" and double-quote (") characters are
> excluded because they are often used as the delimiters around URLs
> in text documents and protocol fields. The character "#" is
> excluded because it is used to delimit a URL from a fragment
> identifier in URL references (Section 3). The percent character "%"
> is excluded because it is used for the encoding of escaped
> characters.
>
> delims = "<" | ">" | "#" | "%" | <">
>
> Other characters are excluded because gateways and other transport
> agents are known to sometimes modify such characters, or they are
> used as delimiters.
>
> unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
>
> Data corresponding to excluded characters must be escaped in order
> to be properly represented within a URL.
> 4. Generic URL Syntax
This may be somewhat disturbing, as it has to be read as
generic URL-Syntax, as opposed to Generic-URL syntax. While
this is distinguished by using a hyphen in the second case,
it could be improved by changing the title of Chapter 4
to "General URL Syntax" or just "URL Syntax", or maybe
something else.
> 4.2. Opaque and Hierarchical URLs
Here we have a similar problem. It's actually opaque and generic
URLs (the later in the sense of generic-URL). Generic-URLs may,
but need not, be hierarchical.
> The URL syntax does not require that the scheme-specific-part have
> any general structure or set of semantics which is common among all
> URLs. However, a subset of URLs do share a common syntax for
> representing hierarchical relationships within the locator namespace.
> This generic-URL syntax is used in interpreting relative URLs.
>
> absoluteURL = generic-URL | opaque-URL
>
> opaque-URL = scheme ":" *urlc
>
> generic-URL = scheme ":" relativeURL
>
> URLs which are hierarchical in nature use the slash "/" character for
> separating hierarchical components. For some file systems, a "/"
> character (used to denote the hierarchical structure of a URL) is the
> delimiter used to construct a file name hierarchy, and thus the URL
> path will look similar to a file pathname. This does NOT imply that
> the URL is a Unix pathname.
The text in this paragraph should probably say that hierarchical
URLs are a subset of gereric-URLs.
> 4.3. URL Syntactic Components
>
> The URL syntax is dependent upon the scheme. Some schemes use
> reserved characters like "?" and ";" to indicate special components,
> while others just consider them to be part of the path. However,
> most URL schemes use a common sequence of four main components to
> define the location of a resource
[this is a preparatory note for a later comment:]
We have four components for an *URL* here.
> To actually be "Uniform" as a resource locator,
> a URL hostname should be a fully qualified domain names. In practice,
> however, the host component may be a local domain literal.
Remove the "s" in "names".
> 4.4. Parsing a URL Reference
>
> A URL reference is typically parsed according to the four main
> components in order to determine what components are present and
> whether or not the reference is relative or absolute.
URLs have four main components. URL references therefore have five
main components.
Regards, Martin.
Received on Monday, 12 May 1997 08:37:03 UTC