Re: Multi-server HTTP

On Fri, Oct 2, 2009 at 5:11 AM, Ford, Alan <alan.ford@roke.co.uk> wrote:
> Hi,
>
>> -----Original Message-----
>> From: Anthony Bryan [mailto:anthonybryan@gmail.com]
>> Sent: 25 September 2009 23:17
>> To: Ford, Alan
>> Cc: Henrik Nordstrom; Mark Nottingham; ietf-http-wg@w3.org; Mark Handley
>> Subject: Re: Multi-server HTTP
>>
>> On Thu, Sep 24, 2009 at 8:23 AM, Ford, Alan <alan.ford@roke.co.uk> wrote:
>> > Henrik's point here goes back to the question of mirroring multiple
>> > files / directory structure. This currently remains un-tackled, but I
>> > think one idea that could be used is the use of a Link: header going to
>> > a form of Metalink that can be used for whole mirrors. Something to come
>> > back to, once the basic concept is clear.
>>
>> Right, nothing so far requires XML though. Do we want to add that dependency?
>>
>> I'd like tools like wget & curl to be able to support this (if they
>> choose to) w/o adding extra dependencies unless required.
>>
>> Thoughts?
>
> Well it would only be an optional extension for mirroring whole directory structures, and wouldn't be a mandatory feature. So I don't see it being too much of a problem to require XML parsing for such an extension. However...
>
> The alternative is some kind of wildcarded Link: header with a version=1.2 or similar parameter, which can then be compared against a .version file (or similar) on the mirror - putting the version controlling in the publisher's court (which I think is a reasonable assumption to make).

What about something simple(?) like "depth" (or something better
named) where "depth=0" is the default and means ONLY that file is
mirrored. A value of 1 means that file and everything else in the
directory are mirrored. 2 means the directory above, and everything in
it, etc

   Link: <http://www2.example.com/dir/dir2/example.ext>;
rel="duplicate"; depth=2

In the above example, /dir2/* would be mirrored...

>> >>    use the Link header [draft-nottingham-http-link-header] to indicate
>> >>    them.  They also MUST provide checksums of files via Instance
>> > Digests
>> >>    in HTTP [RFC3230].  Mirror and checksum information provided by the
>> >>    originating Metalink server MUST be considered authoritative.
>> >>    Metalink servers and their associated mirror servers SHOULD all
>> > share
>> >>    the same ETag policy, i.e. base it on the file contents (checksum)
>> >>    and not server-unique filesystem metadata.
>> >
>> > This must be a MUST, surely, since the whole concept breaks down if we
>> > don't have verification of the same resource on all servers.
>>
>> Ideally, it's a MUST. In the real world, it's hard to get mirrors to
>> coordinate these extra things. I realize there are some
>> inconsistencies in the draft, we want some balance though. Maybe we'll
>> end up having different levels, of preferred (ETag/Instance Digest)
>> mirrors vs regular mirrors.
>>
>> Without them, the early mismatch detection breaks down, yes, but we
>> don't have that in existing Metalink XML clients, and it's not a deal
>> breaker. Of course, it's an added bonus! (Chunk checksums are used,
>> but that's post download). There's always the authoritative whole file
>> checksum obtained from the metalink server in the original request.
>
> Well I guess the best case above is that the client requests an Instance-Digest from every server, and breaks the download if it finds it doesn't match. Not ideal but would work. It's still much nicer to do the test with If-Match - that's what it's there for. I do feel its use should be mandated.

Yes. I agree it's use should be mandated, if it is feasible in the real world.

If it means no one will end up using this, then I'd rather have it
used because it's better than things are now.

>> >>    Mirror servers are typically FTP or HTTP servers that "mirror"
>> >>    another server.  That is, they provide identical copies of (at
>> > least
>> >>    some) files that are also on the mirrored server.  Mirror servers
>> > MAY
>> >>    be Metalink servers.  Mirror servers MUST support serving partial
>> >>    content.  Mirror servers SHOULD support Instance Digests in HTTP
>> >>    [RFC3230].
>> >
>> > I'm wondering if this should be a MUST, for the reasons above. Although
>> > by ensuring ETag similarity this can probably stay as a SHOULD, since
>> > that would remain at the same level of reliability as current HTTP.
>>
>> As above, so below. In some cases, it's hard to get mirrors to buy into.
>
> Maybe so, but the danger of permitting this stop-gap solution is that people will stop there!
>
> Anyway, that Instance-Digest is a SHOULD, but the one before is a MUST. This needs deciding one way or the other :)

Ok...

>> > Fixed-size chunks may be preferable here. Indeed, a client could use a
>> > Metalink XML file (which has chunks and checksums defined, and
>> > potentially linked to from the original request) in the event of an
>> > error to detect, to the level of a fixed-size chunk, which part of the
>> > file is broken and just re-fetch that chunk.
>>
>> Yep.
>>
>> For large files, repairing downloads might be more important than
>> early file mismatch detection.
>
> I think so too. Pointing towards some Metalink XML may be the best way forward there - trying to cram sufficient information in HTTP headers is not ideal.

Yep.

>> > The second issue with initial handshake is what to do regarding sending
>> > data in the initial response.
>> >
>> > In Multi-Server HTTP, we proposed that the client should declare its
>> > capability with a custom header (X-Multiserver-Version), at which point
>> > the server knows to respond with the Mirrors list, and to start
>> > transmitting data immediately but only of a chunk size (Content-Range)
>> > that it is comfortable with. After this, the client will decided what
>> > sizes of chunk it wants from each server and autonomously starts
>> > fetching them.
>>
>> The different sized chunks are really nice!
>>
>> > It's not clear in your draft how you handle this. Reading Section 7 it
>> > sounds as if no data apart from the header is sent, leaving the client
>> > to start requesting Ranges, but this would break non-multilink clients.
>> >
>> > So a possible compromise may be for the client to request a HEAD only,
>> > get all the relevant metadata and then. Not ideal since there's a RTT
>> > delay before the data transfer can start, but that's not a major issue
>> > for big resources, only lots of little requests.
>
> This issue is still unanswered. What were your intentions on this question so far, and what do you think of my solution above?

That was an omission and fixed in a later draft. Data is sent unless
it's a HEAD request.

We can use a custom header to negotiate things, but the nice part is
that all this can be sent w/o negotiating I think, if you wanted to do
things that way.

>> > There is also no mechanism for a server to specify priorities of mirrors
>> > (yet). This could be very useful, even to the extent of being able to
>> > request a client does not download anything from itself since it is just
>> > a broker, or overloaded.
>>
>> Right now, mirrors are listed in order of priority but more detailed
>> information could be useful.
>
> The prio parameter looks like a fine solution.

Indeed, cheers to Mark Nottingham. I added that to the draft.

>> > Compromise: a client can request random sized chunks but they don't
>> > necessarily get it with a Content-MD5 header.
>> >
>> > Question is, how does a client know what the "approved" chunk size is?
>>
>> We could go by file size?
>
> Possibly, something like chunks at 10% of file size for any files > 5MB or something. But I can see that threshold needing to change as bandwidth continues to become more plentiful.
>
> What's more, it doesn't take into account load on servers, RTTs, etc.
>
> Hmm. I don't immediately have a solution! Maybe this would again have to be done via Metalink XML that specifies the chunk size.

Ok.

-- 
(( Anthony Bryan ... Metalink [ http://www.metalinker.org ]
  )) Easier, More Reliable, Self Healing Downloads

Received on Friday, 2 October 2009 22:56:01 UTC