- From: Martin J. Duerst <mduerst@ifi.unizh.ch>
- Date: Mon, 12 May 1997 14:34:43 +0200 (MET DST)
- To: Larry Masinter <masinter@parc.xerox.com>
- cc: URI mailing list <uri@bunyip.com>
On Fri, 2 May 1997, Larry Masinter wrote: > Network Working Group T. Berners-Lee, MIT/LCS > INTERNET-DRAFT R. Fielding, U.C. Irvine > draft-fielding-url-syntax-05 L. Masinter, Xerox Corporation > Expires six months after publication date May 2, 1997 > > Uniform Resource Locators (URL): Generic Syntax and Semantics I have finally had time to have a look at this draft. In this mail, I'll just point out a few corrections that shouldn't cause much discussion. > 2.3. Unreserved Characters > > Data characters which are allowed in a URL but do not have a reserved > purpose are called unreserved. These include upper and lower case > letters, decimal digits, and a limited set of punctuation marks and > symbols. The term "data characters" is never defined or explained. I would suggest to use "URL characters" here, or only "characters". > 2.4. Escape Sequences > > Data must be escaped if it does not have a representation using an > unreserved character; this includes data that does not correspond > to a printable character of the US-ASCII coded character set, or > that corresponds to any US-ASCII character that is disallowed, as > explained below. Here, I would suggest to replace "data" by "octets" (two times, with the appropriate grammatical changes). > 2.4.1. Escaped Encoding > > An escaped octet is encoded as a character triplet, consisting > of the percent character "%" followed by the two hexadecimal digits > representing the octet code. I think it is better to start this sencence as follows: "An octet is escaped by encoding it as a character triplet,...". There was, at some time, a category "escaped character" or "escaped octet", but it was confusing and has been nicely removed. > 2.4.2. When to Escape and Unescape > > A URL is always in an "escaped" form, since escaping or unescaping > a completed URL might change its semantics. Normally, the only > time escape encodings can safely be made is when the URL is being > created from its component parts; each component may have its own > set of characters which are reserved, so only the mechanism > responsible for generating or interpreting that component can > determine whether or not escaping a character will change its > semantics. Likewise, a URL must be separated into its components > before the escaped characters within those components can be safely > decoded. > > In some cases, data that could be represented by an unreserved > character may appear escaped; for example, some of the unreserved > "mark" characters are automatically escaped by some systems. It is > safe to unescape these within the body of a URL. For example, > "%7e" is sometimes used instead of "~" in http URL path, but the > two can be used interchangably. > > Because the percent "%" character always has the reserved purpose of > being the escape indicator, it must be escaped as "%25" in order to > be used as data within a URL. Implementers should be careful not to > escape or unescape the same string more than once, since unescaping > an already unescaped string might lead to misinterpreting a percent > data character as another escaped character, or vice versa in the > case of escaping an already escaped string. > > 2.4.3. Excluded US-ASCII Characters > > Although they are disallowed within the URL syntax, we include here > a description of those US-ASCII characters which have been excluded > and the reasons for their exclusion. > > The control characters in the US-ASCII coded character set are not > use within a URL, both because they are non-printable and because > they are likely to be misinterpreted by some control mechanisms. > > control = <US-ASCII coded characters 00-1F and 7F hexadecimal> > > The space character is excluded because significant spaces may > disappear and insignificant spaces may be introduced when URLs are > transcribed or typeset or subjected to the treatment of > word-processing programs. Whitespace is also used to delimit URLs > in many contexts. > > space = <US-ASCII coded character 20 hexadecimal> > > The angle-bracket "<" and ">" and double-quote (") characters are > excluded because they are often used as the delimiters around URLs > in text documents and protocol fields. The character "#" is > excluded because it is used to delimit a URL from a fragment > identifier in URL references (Section 3). The percent character "%" > is excluded because it is used for the encoding of escaped > characters. > > delims = "<" | ">" | "#" | "%" | <"> > > Other characters are excluded because gateways and other transport > agents are known to sometimes modify such characters, or they are > used as delimiters. > > unwise = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`" > > Data corresponding to excluded characters must be escaped in order > to be properly represented within a URL. > 4. Generic URL Syntax This may be somewhat disturbing, as it has to be read as generic URL-Syntax, as opposed to Generic-URL syntax. While this is distinguished by using a hyphen in the second case, it could be improved by changing the title of Chapter 4 to "General URL Syntax" or just "URL Syntax", or maybe something else. > 4.2. Opaque and Hierarchical URLs Here we have a similar problem. It's actually opaque and generic URLs (the later in the sense of generic-URL). Generic-URLs may, but need not, be hierarchical. > The URL syntax does not require that the scheme-specific-part have > any general structure or set of semantics which is common among all > URLs. However, a subset of URLs do share a common syntax for > representing hierarchical relationships within the locator namespace. > This generic-URL syntax is used in interpreting relative URLs. > > absoluteURL = generic-URL | opaque-URL > > opaque-URL = scheme ":" *urlc > > generic-URL = scheme ":" relativeURL > > URLs which are hierarchical in nature use the slash "/" character for > separating hierarchical components. For some file systems, a "/" > character (used to denote the hierarchical structure of a URL) is the > delimiter used to construct a file name hierarchy, and thus the URL > path will look similar to a file pathname. This does NOT imply that > the URL is a Unix pathname. The text in this paragraph should probably say that hierarchical URLs are a subset of gereric-URLs. > 4.3. URL Syntactic Components > > The URL syntax is dependent upon the scheme. Some schemes use > reserved characters like "?" and ";" to indicate special components, > while others just consider them to be part of the path. However, > most URL schemes use a common sequence of four main components to > define the location of a resource [this is a preparatory note for a later comment:] We have four components for an *URL* here. > To actually be "Uniform" as a resource locator, > a URL hostname should be a fully qualified domain names. In practice, > however, the host component may be a local domain literal. Remove the "s" in "names". > 4.4. Parsing a URL Reference > > A URL reference is typically parsed according to the four main > components in order to determine what components are present and > whether or not the reference is relative or absolute. URLs have four main components. URL references therefore have five main components. Regards, Martin.
Received on Monday, 12 May 1997 08:37:03 UTC