RE: Multi-server HTTP from Ford, Alan on 2009-10-02 (ietf-http-wg@w3.org from October to December 2009)

From: Ford, Alan <alan.ford@roke.co.uk>
Date: Fri, 2 Oct 2009 10:11:16 +0100
To: "Anthony Bryan" <anthonybryan@gmail.com>
Cc: <ietf-http-wg@w3.org>, "Mark Handley" <m.handley@cs.ucl.ac.uk>
Message-ID: <2181C5F19DD0254692452BFF3EAF1D6808933FE7@rsys005a.comm.ad.roke.co.uk>
Hi,

> -----Original Message-----
> From: Anthony Bryan [mailto:anthonybryan@gmail.com]
> Sent: 25 September 2009 23:17
> To: Ford, Alan
> Cc: Henrik Nordstrom; Mark Nottingham; ietf-http-wg@w3.org; Mark Handley
> Subject: Re: Multi-server HTTP
> 
> On Thu, Sep 24, 2009 at 8:23 AM, Ford, Alan <alan.ford@roke.co.uk> wrote:
> > Henrik's point here goes back to the question of mirroring multiple
> > files / directory structure. This currently remains un-tackled, but I
> > think one idea that could be used is the use of a Link: header going to
> > a form of Metalink that can be used for whole mirrors. Something to come
> > back to, once the basic concept is clear.
> 
> Right, nothing so far requires XML though. Do we want to add that dependency?
> 
> I'd like tools like wget & curl to be able to support this (if they
> choose to) w/o adding extra dependencies unless required.
> 
> Thoughts?

Well it would only be an optional extension for mirroring whole directory structures, and wouldn't be a mandatory feature. So I don't see it being too much of a problem to require XML parsing for such an extension. However...

The alternative is some kind of wildcarded Link: header with a version=1.2 or similar parameter, which can then be compared against a .version file (or similar) on the mirror - putting the version controlling in the publisher's court (which I think is a reasonable assumption to make).

> >>    use the Link header [draft-nottingham-http-link-header] to indicate
> >>    them.  They also MUST provide checksums of files via Instance
> > Digests
> >>    in HTTP [RFC3230].  Mirror and checksum information provided by the
> >>    originating Metalink server MUST be considered authoritative.
> >>    Metalink servers and their associated mirror servers SHOULD all
> > share
> >>    the same ETag policy, i.e. base it on the file contents (checksum)
> >>    and not server-unique filesystem metadata.
> >
> > This must be a MUST, surely, since the whole concept breaks down if we
> > don't have verification of the same resource on all servers.
> 
> Ideally, it's a MUST. In the real world, it's hard to get mirrors to
> coordinate these extra things. I realize there are some
> inconsistencies in the draft, we want some balance though. Maybe we'll
> end up having different levels, of preferred (ETag/Instance Digest)
> mirrors vs regular mirrors.
> 
> Without them, the early mismatch detection breaks down, yes, but we
> don't have that in existing Metalink XML clients, and it's not a deal
> breaker. Of course, it's an added bonus! (Chunk checksums are used,
> but that's post download). There's always the authoritative whole file
> checksum obtained from the metalink server in the original request.

Well I guess the best case above is that the client requests an Instance-Digest from every server, and breaks the download if it finds it doesn't match. Not ideal but would work. It's still much nicer to do the test with If-Match - that's what it's there for. I do feel its use should be mandated.

> >>    Mirror servers are typically FTP or HTTP servers that "mirror"
> >>    another server.  That is, they provide identical copies of (at
> > least
> >>    some) files that are also on the mirrored server.  Mirror servers
> > MAY
> >>    be Metalink servers.  Mirror servers MUST support serving partial
> >>    content.  Mirror servers SHOULD support Instance Digests in HTTP
> >>    [RFC3230].
> >
> > I'm wondering if this should be a MUST, for the reasons above. Although
> > by ensuring ETag similarity this can probably stay as a SHOULD, since
> > that would remain at the same level of reliability as current HTTP.
> 
> As above, so below. In some cases, it's hard to get mirrors to buy into.

Maybe so, but the danger of permitting this stop-gap solution is that people will stop there! 

Anyway, that Instance-Digest is a SHOULD, but the one before is a MUST. This needs deciding one way or the other :)

> >> There is also some text about Content-MD5 for partial checksums.
> >
> > Content-MD5 seems a reasonable solution to detecting errors in chunks,
> > but I am still concerned about the overhead.
> 
> Me too. Content-MD5 was removed from the latest revision.
>
> > Fixed-size chunks may be preferable here. Indeed, a client could use a
> > Metalink XML file (which has chunks and checksums defined, and
> > potentially linked to from the original request) in the event of an
> > error to detect, to the level of a fixed-size chunk, which part of the
> > file is broken and just re-fetch that chunk.
> 
> Yep.
> 
> For large files, repairing downloads might be more important than
> early file mismatch detection.

I think so too. Pointing towards some Metalink XML may be the best way forward there - trying to cram sufficient information in HTTP headers is not ideal.

> > I don't understand what you mean here, there's almost no difference
> > between the behaviour of what we wrote in multiserver HTTP and your
> > Metalink Headers.
> >
> > Sure, we specified custom headers, but only because I was unaware that
> > Link: was sufficiently appropriate. It was always the intention (but may
> > not have been clear) that the mirror URLs could be included in a
> > metadata file on the web server edited by the user, and inserted by the
> > server into the headers. The only requirement on coordination between
> > servers is the same checksum and X-If-Checksum-Match: (as it was)
> > behaviour.
> 
> I apologize if I've mis-characterized what you wrote; it's been a
> while since I read it and I've been focused on this.
> 
> Don't all servers have to be running the same multi-server software
> etc? That isn't doable in some cases. But it is really nice and
> preferred!

We only required them to be providing the same checksums. The version header was for the version of multi-server HTTP (not the software itself), and was just for capability detection, as I described below...

> > The second issue with initial handshake is what to do regarding sending
> > data in the initial response.
> >
> > In Multi-Server HTTP, we proposed that the client should declare its
> > capability with a custom header (X-Multiserver-Version), at which point
> > the server knows to respond with the Mirrors list, and to start
> > transmitting data immediately but only of a chunk size (Content-Range)
> > that it is comfortable with. After this, the client will decided what
> > sizes of chunk it wants from each server and autonomously starts
> > fetching them.
> 
> The different sized chunks are really nice!
> 
> > It's not clear in your draft how you handle this. Reading Section 7 it
> > sounds as if no data apart from the header is sent, leaving the client
> > to start requesting Ranges, but this would break non-multilink clients.
> >
> > So a possible compromise may be for the client to request a HEAD only,
> > get all the relevant metadata and then. Not ideal since there's a RTT
> > delay before the data transfer can start, but that's not a major issue
> > for big resources, only lots of little requests.

This issue is still unanswered. What were your intentions on this question so far, and what do you think of my solution above?

> > There is also no mechanism for a server to specify priorities of mirrors
> > (yet). This could be very useful, even to the extent of being able to
> > request a client does not download anything from itself since it is just
> > a broker, or overloaded.
> 
> Right now, mirrors are listed in order of priority but more detailed
> information could be useful.

The prio parameter looks like a fine solution.

> >>     * Content-MD5 for chunk checksums could lead to many random size
> >> chunk checksum requests. Use consistent chunk sizes?
> >
> > Random sized chunks would be preferred since they allow a client to load
> > balance according to the speed of each mirror, but the overhead for the
> > servers to generate these arbitrary, non-cacheable checksums is
> > moderately high.
> 
> Exactly.
> 
> > Compromise: a client can request random sized chunks but they don't
> > necessarily get it with a Content-MD5 header.
> >
> > Question is, how does a client know what the "approved" chunk size is?
> 
> We could go by file size?

Possibly, something like chunks at 10% of file size for any files > 5MB or something. But I can see that threshold needing to change as bandwidth continues to become more plentiful.

What's more, it doesn't take into account load on servers, RTTs, etc.

Hmm. I don't immediately have a solution! Maybe this would again have to be done via Metalink XML that specifies the chunk size.

Regards,
Alan


-- 
Roke Manor Research Ltd, Romsey,
Hampshire, SO51 0ZN, United Kingdom

A Siemens company
Registered in England & Wales at:
Siemens plc, Faraday House, Sir William Siemens Square,
Frimley, Camberley, GU16 8QD. Registered No: 267550
------------------------------------------------------------------------
Visit our website at www.roke.co.uk
------------------------------------------------------------------------
The information contained in this e-mail and any attachments is
proprietary to Roke Manor Research Ltd and must not be passed to any
third party without permission. This communication is for information
only and shall not create or change any contractual relationship.
------------------------------------------------------------------------

Please consider the environment before printing this email
Received on Friday, 2 October 2009 09:12:13 UTC