- From: Ford, Alan <alan.ford@roke.co.uk>
- Date: Thu, 24 Sep 2009 13:23:40 +0100
- To: "Anthony Bryan" <anthonybryan@gmail.com>, "Henrik Nordstrom" <henrik@henriknordstrom.net>
- Cc: "Mark Nottingham" <mnot@mnot.net>, <ietf-http-wg@w3.org>, "Mark Handley" <m.handley@cs.ucl.ac.uk>
Hi Anthony, all, I have now had some time to read this latest draft. It seems a nice way of linking the multi-server and metalink concepts. I am pleased that you've considered our Multi-server HTTP ideas in this and we'd be happy to collaborate further to create a standardised way of achieving this efficiency across servers. I have several comments about the operation, but I'll come to those after the points in the below email... > On Fri, Aug 28, 2009 at 11:27 AM, Henrik Nordstrom > <henrik@henriknordstrom.net> wrote: > > fre 2009-08-28 klockan 12:38 +0100 skrev Ford, Alan: > > > > So I would recommend the following slightly different approach to your > > problem. > > > > * Define a new Mirror profile object, similar to MetaLink but defining > > the mirror URL policy for groups of URLs on the server, without going > > into checksums etc (HTTP will give those). Henrik's point here goes back to the question of mirroring multiple files / directory structure. This currently remains un-tackled, but I think one idea that could be used is the use of a Link: header going to a form of Metalink that can be used for whole mirrors. Something to come back to, once the basic concept is clear. > Henrik, I have added your suggestions about ETags to my draft ( > http://tools.ietf.org/html/draft-bryan-metalinkhttp ) almost verbatim. > I didn't try to reword it, and if this is a problem, let me know. > I am looking for interested collaborators and co-authors, and you've > provided great insight. Would you like to join us? > > Here is the current description: > > Metalink servers are HTTP servers that MUST have lists of mirrors and Not sure it makes sense to define everything in this document around being a "Metalink server". After all, this is just an extension to HTTP, and relatively separate to the existing Metalink work (AIUI). > use the Link header [draft-nottingham-http-link-header] to indicate > them. They also MUST provide checksums of files via Instance Digests > in HTTP [RFC3230]. Mirror and checksum information provided by the > originating Metalink server MUST be considered authoritative. > Metalink servers and their associated mirror servers SHOULD all share > the same ETag policy, i.e. base it on the file contents (checksum) > and not server-unique filesystem metadata. This must be a MUST, surely, since the whole concept breaks down if we don't have verification of the same resource on all servers. > The emitted ETag may be > implemented the same as the Instance Digest for simplicity. OK so the purpose of this is so that we can use the If-Match: header for matching digests. I can't immediately think there'd be any problems with this but are there any server implementations that use ETags for any other purpose and so this may break their assumptions? I judged from the original responses I got that people were generally against new headers if at all possible so this is almost certainly the best solution so long as it doesn't hinder deployment. > Mirror servers are typically FTP or HTTP servers that "mirror" > another server. That is, they provide identical copies of (at least > some) files that are also on the mirrored server. Mirror servers MAY > be Metalink servers. Mirror servers MUST support serving partial > content. Mirror servers SHOULD support Instance Digests in HTTP > [RFC3230]. I'm wondering if this should be a MUST, for the reasons above. Although by ensuring ETag similarity this can probably stay as a SHOULD, since that would remain at the same level of reliability as current HTTP. > Metalink clients use the mirrors provided by a Metalink server with > Link header [draft-nottingham-http-link-header]. Metalink clients > MUST support HTTP and MAY support FTP, BitTorrent, or other download > methods. Metalink clients MUST switch downloads from one mirror to > another if the one mirror becomes unreachable. Metalink clients are > RECOMMENDED to support multi-source, or parallel, downloads, where > chunks of a file are downloaded from multiple mirrors simultaneously > (and optionally, from Peer-to-Peer sources). Metalink clients MUST > support Instance Digests in HTTP [RFC3230] by requesting and > verifying checksums. Metalink clients MAY make use of digital > signatures if they are offered. > > There is also some text about Content-MD5 for partial checksums. Content-MD5 seems a reasonable solution to detecting errors in chunks, but I am still concerned about the overhead. Fixed-size chunks may be preferable here. Indeed, a client could use a Metalink XML file (which has chunks and checksums defined, and potentially linked to from the original request) in the event of an error to detect, to the level of a fixed-size chunk, which part of the file is broken and just re-fetch that chunk. > I have read draft-ford-http-multi-server and my main comment is that > the required coordination of all mirror servers may be difficult or > impossible unless you are in control of all servers on the mirror > network. > I don't see this as possible in the open source mirror networks that I > follow, but might be for commercial CDNs? In any case, this > coordination is not required in my draft. I don't understand what you mean here, there's almost no difference between the behaviour of what we wrote in multiserver HTTP and your Metalink Headers. Sure, we specified custom headers, but only because I was unaware that Link: was sufficiently appropriate. It was always the intention (but may not have been clear) that the mirror URLs could be included in a metadata file on the web server edited by the user, and inserted by the server into the headers. The only requirement on coordination between servers is the same checksum and X-If-Checksum-Match: (as it was) behaviour. > Finally, here are some issues with my own draft: > > * Mirror negotiation. Only send a few mirrors, or only send them > if Want-Digest is used? Some organizations have many mirrors. Right, this is part of a more general issue of a server's behaviour on initial connection. Want-Digest would be a good clue, but is not a guarantee, about the client's intentions. However, if we are certain the client wants the Mirror list, and the resource is large enough to warrant their use, then I feel all should be sent, unless the server has its own priorities regarding balancing the load across the mirrors. The second issue with initial handshake is what to do regarding sending data in the initial response. In Multi-Server HTTP, we proposed that the client should declare its capability with a custom header (X-Multiserver-Version), at which point the server knows to respond with the Mirrors list, and to start transmitting data immediately but only of a chunk size (Content-Range) that it is comfortable with. After this, the client will decided what sizes of chunk it wants from each server and autonomously starts fetching them. It's not clear in your draft how you handle this. Reading Section 7 it sounds as if no data apart from the header is sent, leaving the client to start requesting Ranges, but this would break non-multilink clients. So a possible compromise may be for the client to request a HEAD only, get all the relevant metadata and then. Not ideal since there's a RTT delay before the data transfer can start, but that's not a major issue for big resources, only lots of little requests. There is also no mechanism for a server to specify priorities of mirrors (yet). This could be very useful, even to the extent of being able to request a client does not download anything from itself since it is just a broker, or overloaded. Is there a recognised extension to Link: that could be used for specifying priority? (Mark?) > * Some publishers desire stronger hashes than MD5 and SHA-1. RFC3230 is extensible, new digest algorithms could be added to the IANA registry (via RFC) if required. > * Content-MD5 for chunk checksums could lead to many random size > chunk checksum requests. Use consistent chunk sizes? Random sized chunks would be preferred since they allow a client to load balance according to the speed of each mirror, but the overhead for the servers to generate these arbitrary, non-cacheable checksums is moderately high. Compromise: a client can request random sized chunks but they don't necessarily get it with a Content-MD5 header. Question is, how does a client know what the "approved" chunk size is? > * Do we want a way to show that whole directories are mirrored, > instead of individual files? I would still like to see this, as I wrote earlier. But however we do it, it will require significant metadata if we want to ensure each file being mirrored is the same as what the original thinks it is. Unless, we just have versioning of the directory structure? E.g. a publisher puts a ".version" file in a directory, entirely arbitrary, and we verify that value is the same on the mirrors, not using a checksum at all. Not perfect, but a reasonable compromise? Finally, do you intend that this draft should contain recommendations of what to do at the client end - e.g. behaviour on checksum failure? While I can see this is not normative for a document such as this, some guidelines may be useful. Regards, Alan -- Roke Manor Research Ltd, Romsey, Hampshire, SO51 0ZN, United Kingdom A Siemens company Registered in England & Wales at: Siemens plc, Faraday House, Sir William Siemens Square, Frimley, Camberley, GU16 8QD. Registered No: 267550 ------------------------------------------------------------------------ Visit our website at www.roke.co.uk ------------------------------------------------------------------------ The information contained in this e-mail and any attachments is proprietary to Roke Manor Research Ltd and must not be passed to any third party without permission. This communication is for information only and shall not create or change any contractual relationship. ------------------------------------------------------------------------ Please consider the environment before printing this email
Received on Thursday, 24 September 2009 12:24:56 UTC