Re: Byte ranges -- formal spec proposal from Chuck Shotton on 1995-05-18 (ietf-http-wg@w3.org from April to June 1995)

From: Chuck Shotton <cshotton@biap.com>
Date: Thu, 18 May 1995 17:58:20 -0500
To: Brian Behlendorf <brian@organic.com>
Cc: David - Morris <dwm@shell.portal.com>, John Franks <john@math.nwu.edu>, luotonen@netscape.com, http-wg%cuckoo.hpl.hp.com@hplb.hpl.hp.com
Message-Id: <v02110106abe178820d09@[129.106.201.2]>
At 2:55 PM 5/18/95, Brian Behlendorf wrote:
>On Thu, 18 May 1995, Chuck Shotton wrote:
>However, if, as I proposed, the client takes everything after the ; and
>makes them HTTP request headers, the server never sees ";byterange=blah" in
>the GET line of the HTTP request.

This requires changes to WWW clients AND servers. The current proposal
leaves clients unaffected.

>> I agree. The first byte in a file is byte number 1. If I want the second
>> thru fourth bytes of a file, I want to specify the range 2-4, not 1-3. The
>> latter is hardly intuitive.
>
>But if people aren't constructing these ranges by hand, what difference
>does it make whether the number maps to the mnemonic that "the 1st object
>should be called object number 1"?

Well, what is gained by calling the first byte in a file "0", or the 50th
byte in a file 49? This is a tiny nit, but the world isn't a C array
variable. It's a heck of a lot easier to look at 20-30 (when debugging,
developing, or anything else) and realize it means bytes twenty through
thirty and not twenty one through thirty one.

>> I still maintain that "?" is the appropriate separator for byte range
>> syntax.
>
>Uh, how then do I use CGI QUERY_STRING variables along with byterange?

Details, details. :) But how does a VMS server know that ";22" means byte
22 and not version 22 of the file? How about if we re-visit the "#"
proposal again and investigate whether or not this really hoses clients as
I first suspected?

>> You are asking the server to search for a particular range of bytes
>> in a file, which is consistent with searching for keywords in a file, or
>> coordinates on a map. As I mentioned earlier, semicolon has a specific
>> meaning in many file systems that will conflict with its use as a separator
>> for byte range info. The use of ";" is a bad choice.
>
>But isn't "?" a perfectly valid Mac filename character too?

Sure, but it SHOULD be encoded as part of the file name with %xx encodings,
were the ? used as a search arg separator isn't encoded. If you parse URLs
PRIOR TO %xx decodings, special chars retain their meaning. I guess this is
an argument for VMS servers (and HTML authors) to encode ";" in file names.

>Ugh, sometimes this whole system just makes you want to scream.

Unfortunately, what we are really doing is taking away the opacity of URLs.
This was a big discussion point between myself and Adobe when we originally
hashed out some of the problems with their first CGI-based byte range
scheme. Ideally, there should be no need for a "standard" byte range
syntax, because URLs are either entered by document authors or generated by
server side applications. Since all URLs currently originate from the
server side (unless hand-entered by a user), there hasn't been a need to
standardize the path portion of URLs.

The reason this is an issue (though Ari hasn't explicitly said it) is that
suddenly, there is a need for client side helpers to be able to generate
and request "non-standard" URLs in a cross platform way. Specifically,
Acrobat needs a cross-platform way to request byte ranges. Suddenly, the
path portions of URLs aren't coming from the server (either being generated
or contained in HTML docs). Instead, smart things on the client side are
trying to invade the turf of the server's interpretation of URLs and make
them up themselves. This eliminates the latitude that servers have enjoyed
in keeping URL information private as far as semantic interpretation is
concerned.

I proposed an alternate scheme, where servers TELL the clients how to
request byte ranges, so that the client may do it in a server specific way
without having to have knowledge of the specific server's syntax for byte
ranges. See the "*" section below for more details. It works fine for
specific viewer apps, but needs more work to be a general solution.

This is a paradigm shift that shouldn't be allowed to pass without careful
scrutiny of the implications. I agree that in this specific case, a common
byte range syntax would be nice. It would be extra nice if servers all
supported it. The problem is that we are paying for this niceness by
forcing servers to give up total control over the interpretation of the
path portion of a URL. This could drastically complicate the URL standard
and the coordination required between client and server apps.

I think we should view this particular proposal as a single-case solution
and avoid trying to generalize things for a bit. It would actually be
better if this was a vendor proposed extension that could optionally be
supported by server authors, rather than trying to shoehorn it into the
existing standards. As I said, it appears to perturb the opaque URL
assumptions that the Web is based on. In specific cases such as the Acrobat
example, the impact is minimal and the benefits are large. However,
allowing clients to continue to drive server behavior in regard to URL
interpretation is a slippery path.

>> Making a new HTTP header means that it will never gain support. Allowing it
>> to be part of the URL (where it belongs in my opinion) means that it can be
>> retrofitted into existing servers with the addition of a CGI. And as for #,
>> ? is a better choice than that or ";".
>
>Ask the URI working group where they would rather see this functionality
>implemented, and they'll probably say HTTP.  Why won't new HTTP headers
>get support?

Because existing clients will have to be modified, distributed, and be in
use before byte range URLs will work universally. If it is tagged onto the
URL, existing clients will work without modification.

The place to fix this is in the server, where you can fix it once, rather
than having to battle with upgrading the entire installed base of WWW
clients. If a server doesn't support byte ranges, it's a safe bet that it
won't be serving URLs pointing back to itself that specify byte ranges. If,
on the other hand, byte ranges are implemented as HTTP request header
fields, and a client doesn't support it, a server that generates URLs with
byte ranges won't be able to operate with a client that doesn't understand
them.

* Alternate proposal for byte range URL generation:
Originally, Adobe proposed a CGI-based syntax for retrieving byte ranges,
which passed a numeric range to a CGI, which would read the bytes and
return them. The syntax of the URL that Acrobat generated assumed all CGIs
live in /cgi-bin and have path arguments separated from the URL by a "/".
This obviously breaks on many non-Unix servers.

As an alternative, I suggested that Adobe develop a CGI that when called
from Acrobat with no arguments, returned the server's preferred syntax
(e.g., a C sprintf format statement) for specifying the URL to the CGI and
byte range arguments. The client/viewer (Acrobat) could then use this
syntax in subsequent URL requests (sent thru the WWW client) to request
byte ranges.

Of course, this scheme implies an intelligent viewer like Acrobat, which is
simply using the HTTP server as a convenient way to get random access to a
distribute file system. This doesn't handle the general purpose case for
all documents that have ranges of bytes in them that *could* be viewed by a
dumb WWW client.

However, it does show that there are alternate methods to solving this
problem besides hacking URL syntax or HTTP header contents. And, it is
possible to retain a server's control over the interpretation of the URL
paths sent to it. I would really like to encourage everyone to spend some
time considering these proposals in detail before we rush off to add some
more duct tape and bailing wire to the existing standards. If we can figure
a way to do this with some standard CGI behavior, the entire HTTP/HTML/URI
standards process is left unmolested and we will have probably done the
right thing. There doesn't appear to be any compelling reason why this byte
range thing has to be implemented as a change to these existing standards
instead of some private, CGI-based implementation.

-----------------------------------------------------------------------
Chuck Shotton
cshotton@biap.com                                  http://www.biap.com/
cshotton@oac.hsc.uth.tmc.edu                           "I am NOT here."
Received on Thursday, 18 May 1995 15:59:10 UTC