Re: Multi-server HTTP from Anthony Bryan on 2009-09-25 (ietf-http-wg@w3.org from July to September 2009)

From: Anthony Bryan <anthonybryan@gmail.com>
Date: Fri, 25 Sep 2009 18:17:04 -0400
To: "Ford, Alan" <alan.ford@roke.co.uk>
Cc: Henrik Nordstrom <henrik@henriknordstrom.net>, Mark Nottingham <mnot@mnot.net>, ietf-http-wg@w3.org, Mark Handley <m.handley@cs.ucl.ac.uk>
Message-ID: <bb9e09ee0909251517l6045ee34qffe1751a5120c538@mail.gmail.com>
On Thu, Sep 24, 2009 at 8:23 AM, Ford, Alan <alan.ford@roke.co.uk> wrote:
> Hi Anthony, all,
>
> I have now had some time to read this latest draft. It seems a nice way
> of linking the multi-server and metalink concepts. I am pleased that
> you've considered our Multi-server HTTP ideas in this and we'd be happy
> to collaborate further to create a standardised way of achieving this
> efficiency across servers.

Excellent!

> I have several comments about the operation, but I'll come to those
> after the points in the below email...

Thank you for taking the time to review it & respond to both drafts.

>> On Fri, Aug 28, 2009 at 11:27 AM, Henrik Nordstrom
>> <henrik@henriknordstrom.net> wrote:
>> > fre 2009-08-28 klockan 12:38 +0100 skrev Ford, Alan:
>> >
>> > So I would recommend the following slightly different approach to
> your
>> > problem.
>> >
>> > * Define a new Mirror profile object, similar to MetaLink but
> defining
>> > the mirror URL policy for groups of URLs on the server, without
> going
>> > into checksums etc (HTTP will give those).
>
> Henrik's point here goes back to the question of mirroring multiple
> files / directory structure. This currently remains un-tackled, but I
> think one idea that could be used is the use of a Link: header going to
> a form of Metalink that can be used for whole mirrors. Something to come
> back to, once the basic concept is clear.

Right, nothing so far requires XML though. Do we want to add that dependency?

I'd like tools like wget & curl to be able to support this (if they
choose to) w/o adding extra dependencies unless required.

Thoughts?

>> Henrik, I have added your suggestions about ETags to my draft (
>> http://tools.ietf.org/html/draft-bryan-metalinkhttp ) almost verbatim.
>> I didn't try to reword it, and if this is a problem, let me know.
>> I am looking for interested collaborators and co-authors, and you've
>> provided great insight. Would you like to join us?
>>
>> Here is the current description:
>>
>>    Metalink servers are HTTP servers that MUST have lists of mirrors
> and
>
> Not sure it makes sense to define everything in this document around
> being a "Metalink server". After all, this is just an extension to HTTP,
> and relatively separate to the existing Metalink work (AIUI).

AIUI existing metalink work is about improving downloads. This has
mostly been done with mirrors and checksums (& more goodies when
possible). I don't see much difference.
That's why it took almost 20 lines of code to update an existing
metalink XML client to support this new draft.

A "Metalink server" by any other name would function just as sweet, as
they say. :)

As you can tell from the draft, the naming isn't totally consistent
throughout. We've used "Metalink in HTTP Headers", "MetaLinkHeader",
"Metalink/HTTP" and as we mention it is mostly a
collection/coordination of features of HTTP.

Any suggestions? Cool acronyms? :)

>>    use the Link header [draft-nottingham-http-link-header] to indicate
>>    them.  They also MUST provide checksums of files via Instance
> Digests
>>    in HTTP [RFC3230].  Mirror and checksum information provided by the
>>    originating Metalink server MUST be considered authoritative.
>>    Metalink servers and their associated mirror servers SHOULD all
> share
>>    the same ETag policy, i.e. base it on the file contents (checksum)
>>    and not server-unique filesystem metadata.
>
> This must be a MUST, surely, since the whole concept breaks down if we
> don't have verification of the same resource on all servers.

Ideally, it's a MUST. In the real world, it's hard to get mirrors to
coordinate these extra things. I realize there are some
inconsistencies in the draft, we want some balance though. Maybe we'll
end up having different levels, of preferred (ETag/Instance Digest)
mirrors vs regular mirrors.

Without them, the early mismatch detection breaks down, yes, but we
don't have that in existing Metalink XML clients, and it's not a deal
breaker. Of course, it's an added bonus! (Chunk checksums are used,
but that's post download). There's always the authoritative whole file
checksum obtained from the metalink server in the original request.


>>    The emitted ETag may be
>>    implemented the same as the Instance Digest for simplicity.
>
> OK so the purpose of this is so that we can use the If-Match: header for
> matching digests. I can't immediately think there'd be any problems with
> this but are there any server implementations that use ETags for any
> other purpose and so this may break their assumptions? I judged from the
> original responses I got that people were generally against new headers
> if at all possible so this is almost certainly the best solution so long
> as it doesn't hinder deployment.
>
>>    Mirror servers are typically FTP or HTTP servers that "mirror"
>>    another server.  That is, they provide identical copies of (at
> least
>>    some) files that are also on the mirrored server.  Mirror servers
> MAY
>>    be Metalink servers.  Mirror servers MUST support serving partial
>>    content.  Mirror servers SHOULD support Instance Digests in HTTP
>>    [RFC3230].
>
> I'm wondering if this should be a MUST, for the reasons above. Although
> by ensuring ETag similarity this can probably stay as a SHOULD, since
> that would remain at the same level of reliability as current HTTP.

As above, so below. In some cases, it's hard to get mirrors to buy into.

>>    Metalink clients use the mirrors provided by a Metalink server with
>>    Link header [draft-nottingham-http-link-header].  Metalink clients
>>    MUST support HTTP and MAY support FTP, BitTorrent, or other
> download
>>    methods.  Metalink clients MUST switch downloads from one mirror to
>>    another if the one mirror becomes unreachable.  Metalink clients
> are
>>    RECOMMENDED to support multi-source, or parallel, downloads, where
>>    chunks of a file are downloaded from multiple mirrors
> simultaneously
>>    (and optionally, from Peer-to-Peer sources).  Metalink clients MUST
>>    support Instance Digests in HTTP [RFC3230] by requesting and
>>    verifying checksums.  Metalink clients MAY make use of digital
>>    signatures if they are offered.
>>
>> There is also some text about Content-MD5 for partial checksums.
>
> Content-MD5 seems a reasonable solution to detecting errors in chunks,
> but I am still concerned about the overhead.

Me too. Content-MD5 was removed from the latest revision.

> Fixed-size chunks may be preferable here. Indeed, a client could use a
> Metalink XML file (which has chunks and checksums defined, and
> potentially linked to from the original request) in the event of an
> error to detect, to the level of a fixed-size chunk, which part of the
> file is broken and just re-fetch that chunk.

Yep.

For large files, repairing downloads might be more important than
early file mismatch detection.

>> I have read draft-ford-http-multi-server and my main comment is that
>> the required coordination of all mirror servers may be difficult or
>> impossible unless you are in control of all servers on the mirror
>> network.
>> I don't see this as possible in the open source mirror networks that I
>> follow, but might be for commercial CDNs? In any case, this
>> coordination is not required in my draft.
>
> I don't understand what you mean here, there's almost no difference
> between the behaviour of what we wrote in multiserver HTTP and your
> Metalink Headers.
>
> Sure, we specified custom headers, but only because I was unaware that
> Link: was sufficiently appropriate. It was always the intention (but may
> not have been clear) that the mirror URLs could be included in a
> metadata file on the web server edited by the user, and inserted by the
> server into the headers. The only requirement on coordination between
> servers is the same checksum and X-If-Checksum-Match: (as it was)
> behaviour.

I apologize if I've mis-characterized what you wrote; it's been a
while since I read it and I've been focused on this.

Don't all servers have to be running the same multi-server software
etc? That isn't doable in some cases. But it is really nice and
preferred!

>> Finally, here are some issues with my own draft:
>>
>>     * Mirror negotiation. Only send a few mirrors, or only send them
>> if Want-Digest is used? Some organizations have many mirrors.
>
> Right, this is part of a more general issue of a server's behaviour on
> initial connection. Want-Digest would be a good clue, but is not a
> guarantee, about the client's intentions. However, if we are certain the
> client wants the Mirror list, and the resource is large enough to
> warrant their use, then I feel all should be sent, unless the server has
> its own priorities regarding balancing the load across the mirrors.

Yes.

> The second issue with initial handshake is what to do regarding sending
> data in the initial response.
>
> In Multi-Server HTTP, we proposed that the client should declare its
> capability with a custom header (X-Multiserver-Version), at which point
> the server knows to respond with the Mirrors list, and to start
> transmitting data immediately but only of a chunk size (Content-Range)
> that it is comfortable with. After this, the client will decided what
> sizes of chunk it wants from each server and autonomously starts
> fetching them.

The different sized chunks are really nice!

> It's not clear in your draft how you handle this. Reading Section 7 it
> sounds as if no data apart from the header is sent, leaving the client
> to start requesting Ranges, but this would break non-multilink clients.
>
> So a possible compromise may be for the client to request a HEAD only,
> get all the relevant metadata and then. Not ideal since there's a RTT
> delay before the data transfer can start, but that's not a major issue
> for big resources, only lots of little requests.
>
> There is also no mechanism for a server to specify priorities of mirrors
> (yet). This could be very useful, even to the extent of being able to
> request a client does not download anything from itself since it is just
> a broker, or overloaded.

Right now, mirrors are listed in order of priority but more detailed
information could be useful.

> Is there a recognised extension to Link: that could be used for
> specifying priority? (Mark?)
>
>>     * Some publishers desire stronger hashes than MD5 and SHA-1.
>
> RFC3230 is extensible, new digest algorithms could be added to the IANA
> registry (via RFC) if required.

These were included in the latest revision.

>>     * Content-MD5 for chunk checksums could lead to many random size
>> chunk checksum requests. Use consistent chunk sizes?
>
> Random sized chunks would be preferred since they allow a client to load
> balance according to the speed of each mirror, but the overhead for the
> servers to generate these arbitrary, non-cacheable checksums is
> moderately high.

Exactly.

> Compromise: a client can request random sized chunks but they don't
> necessarily get it with a Content-MD5 header.
>
> Question is, how does a client know what the "approved" chunk size is?

We could go by file size?

>>     * Do we want a way to show that whole directories are mirrored,
>> instead of individual files?
>
> I would still like to see this, as I wrote earlier. But however we do
> it, it will require significant metadata if we want to ensure each file
> being mirrored is the same as what the original thinks it is.
>
> Unless, we just have versioning of the directory structure? E.g. a
> publisher puts a ".version" file in a directory, entirely arbitrary, and
> we verify that value is the same on the mirrors, not using a checksum at
> all. Not perfect, but a reasonable compromise?

I really don't know what the best solution is. We don't have any
experience with this issue so far.

> Finally, do you intend that this draft should contain recommendations of
> what to do at the client end - e.g. behaviour on checksum failure? While
> I can see this is not normative for a document such as this, some
> guidelines may be useful.

There is a short "Error Recovery" section that will be corrected and expanded.

-- 
(( Anthony Bryan ... Metalink [ http://www.metalinker.org ]
  )) Easier, More Reliable, Self Healing Downloads
Received on Friday, 25 September 2009 22:17:45 UTC