- From: Alex Russell <slightlyoff@google.com>
- Date: Sat, 1 Feb 2014 18:26:55 -0800
- To: Herbert Van de Sompel <hvdsomp@gmail.com>
- Cc: Jeni Tennison <jeni@jenitennison.com>, "www-tag@w3.org" <www-tag@w3.org>
- Message-ID: <CANr5HFWQkOwL7iLuS7Rji0RDnOgCOx7fbLgddvfow5nj=1ZzEw@mail.gmail.com>
On Fri, Jan 31, 2014 at 2:22 PM, Herbert Van de Sompel <hvdsomp@gmail.com>wrote: > On Jan 31, 2014, at 14:50, Alex Russell <slightlyoff@google.com> wrote: > > On Tue, Jan 28, 2014 at 4:50 AM, Herbert van de Sompel <hvdsomp@gmail.com>wrote: > >> Dear Jeni, >> >> I read this with quite some interest in relation to the OAI-ORE >> specification [1], an RDF-based approach for identifying and >> describing aggregations of web resources. In the Research Object work >> [2,3], the need was expressed to be able to package the description of >> an aggregation of web resources along with representations of the >> aggregated resources. The approach you describe looks rather >> attractive with this regard. The "root" document in case of OAI-ORE >> would be an RDF serialization. >> >> The following questions come up: >> >> (a) With different serializations of the root RDF document being >> available, how does a client request a preferred one? From the " >> Accept: multipart/package,text/json;q=0.9" example you provide, it >> seems the proposed approach does not provide that functionality. >> - Is my reading correct? >> - Could different representations of the root RDF document be included >> in the same package? >> > > These packages are about bits, "files" if you will, with (potentially > differing) URLs and mime types. I think you've mistaken them for something > higher level. > > > > I think I understand it's about bits. I am looking for an approach that > allows requesting a root document in a specific serialization, eg RDF/XML, > JSON-LD, ... I think Jeni's write-up suggests that's not covered. > That's my reading as well, but I don't view it as a bug. What's a "root document" for a package of JSON segments of a data model? What's a "root document" for an app? > In which case I was wondering whether all available serializations of a > root document could be included. > I see. One assumes so, but using different sub-urls. This doesn't strike me as any sort of real burden...or am I missing something? > That would allow a client to select the desired serialization of the root > document from the package instead of requesting it from the server. > > Cheers > > Herbert > > > (b) More generally, how does the proposed approach work with the 303 >> style approaches of Cool URIs for the Semantic Web [4], i.e. how to >> express interest in a package with an RDF root document, an HTML root >> document? >> >> (c) The constraint to limit `Content-Location`s for the files in a >> package to the domain that the package is hosted on is understandable. >> Still, for example in the Research Object case, there can be a need to >> include a specific representation of a "remote" aggregated resource, >> e.g. a specific version of an evolving resource that was used to >> obtain a research result. In this case, does the constraint imply that >> the author of the aggregation needs to: >> - Collect that specific representation (this would have to be done >> whichever way) >> - Host that specific representation at a URI under its control >> - Include that specific representation in the package with that URI as >> Content-Location >> - Provide a link to the remote URI in the header fields for that >> specific representation with an appropriate relation type, e.g. >> Memento's "original" >> >> Greetings >> >> Herbert >> >> [1] http://www.openarchives.org/ore/1.0/datamodel >> [2] http://www.researchobject.org/ >> [3] http://www.w3.org/community/rosc/ >> [4] http://www.w3.org/TR/cooluris/ >> >> On Tue, Jan 21, 2014 at 5:09 AM, Jeni Tennison <jeni@jenitennison.com> >> wrote: >> > Hi, >> > >> > I took on the task of documenting the outcomes of our discussions >> around packaging at the last F2F [1] and the one before that [2]. >> > >> > I have written a very draft something at >> > >> > https://github.com/w3ctag/packaging-on-the-web >> > >> > which suggests a way forward that doesn't involve changing the way in >> which URLs are parsed. Marcos has already added two issues against it >> (thanks Marcos!). Other thoughts also welcome! >> > >> > I have copied below for those people who don't like following links >> from mails and to aid discussion here. >> > >> > Cheers, >> > >> > Jeni >> > >> > [1] http://www.w3.org/2001/tag/2014/01/08-minutes.html#item07 >> > [2] http://www.w3.org/2001/tag/2013/10/01-minutes.html#item06 >> > >> > --- >> > # Packaging on the Web >> > >> > This document describes an approach for creating packages of files for >> use on the web. The approach is to package them using a new >> `multipart/package` media type and a `+package` structured syntax. To >> access packages related to other files on the web, clients that understand >> packages of files look for a `Link` header or (in HTML documents) a >> `<link>` element with a new link relation of `package`. Other formats may >> define format-specific mechanisms for locating related packages. >> > >> > **This is an unreviewed draft by Jeni Tennison and does not represent >> official TAG or W3C opinion.** >> > >> > ## Requirements >> > >> > There are two main requirements for packages on the web: >> > >> > * efficient delivery of related content on the web >> > * easy distribution of self-contained content >> > >> > There is also a final cross-cutting requirement: that the solution can >> be easily and backwards-compatibly deployed on the web. >> > >> > ### Efficient Delivery >> > >> > If a user visits `http://www.bbc.co.uk/` <http://www.bbc.co.uk/> they >> will need to download about 160 files to view the page in its entirety. The >> HTML page they download at `http://www.bbc.co.uk/`<http://www.bbc.co.uk/>contains references to stylesheets, scripts, images and other files, each >> of which may contain references to further files themselves. >> > >> > Delivering a package of these files could be more efficient than >> delivering individual files. Downloading each file has a connection >> overhead which is particularly impactful on low-bandwidth mobile devices >> and on secure connections. >> > >> > The browser can't work out which additional files to download until it >> receives the HTML page, but the server could plausibly deliver a package >> that contains all the required files through a single connection. This >> would have to work in a backwards compatible way for both older browsers >> interacting with package-aware servers, and for package-aware clients >> working with older servers. >> > >> >> *Note: Efficient delivery is the aim of [pipelining in HTTP 1.1]( >> http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html) (and [HTTPbis]( >> http://tools.ietf.org/html/draft-ietf-httpbis-p1-messaging-25#section-6.3.2)) >> and [multiplexing in HTTP 2.0]( >> http://tools.ietf.org/html/draft-ietf-httpbis-http2-09#section-2.2). >> These enable multiple requests to be passed, and responded to, over the >> same persistent connection. But they do not enable the server to deliver >> content for predicted requests.* >> > >> >> *Note: The `rel=prefetch` link relation prompts the browser to access >> additional HTML pages that have not yet been requested by the user, but >> again this is a different facility; there is no such think as prefetching >> stylesheets, scripts or images as these are already required for the page >> by the time the browser knows about them.* >> > >> >> *Note: given that browsers can typically have more than one connection >> open to a website, and download files in parallel, is there an argument for >> supporting having multiple packages associated with a given page?* >> > >> > >> > ### Self-Contained Content >> > >> > It is sometimes useful to distribute self-contained packages of >> material. Examples are [Packaged Web Apps](http://www.w3.org/TR/widgets/) >> or packages of CSV files used within the [Simple Data Format]( >> http://dataprotocols.org/simple-data-format/) or the [DataSet Publishing >> Language](https://developers.google.com/public-data/). These packages >> typically contain a manifest file, in a machine readable format (text, JSON >> or XML), that contains further details and metadata about the files that >> the package contains. >> > >> > ## Packaging Format >> > >> >> *Note: For rationale for selecting this above other packaging formats, >> see 'Rejected Approaches' below.* >> > >> > The TAG recommends using [multipart media types]( >> http://tools.ietf.org/html/rfc2046#section-5.1) for packaging materials >> on the web. Specifically we recommend the registration of a new >> `multipart/package` media type and the registration of a new `+package` >> structured syntax suffix, per [RFC 6838]( >> http://tools.ietf.org/html/rfc6838#section-6). >> > >> > Multipart media types are defined in [RFC 2046]( >> http://tools.ietf.org/html/rfc2046#section-5.1). They can contain one or >> more *body parts*, each comprising: >> > >> > * a *boundary* >> > * a number of *header fields* >> > * an empty line >> > * the *content* of the file >> > >> >> *Ed note: From what I can tell it's possible for the header fields >> within a multipart response to be anything so long as it follows the normal >> `Header: Value` syntax used in MIME and HTTP. So this could be a very >> flexible mechanism for adding metadata.* >> > >> > The `multipart/mixed` media type places no constraints on which header >> fields are specified within the multipart file (see the definition of >> `MIME-part-headers` in [RFC 2045]( >> http://tools.ietf.org/html/rfc2045#section-3)). >> > >> > The only difference between `multipart/mixed` and `multipart/package` >> is that the header fields in every body part includes the >> `Content-Location` header. Further, the values of the `Content-Location` >> header must all be [absolute-path-relative URLs]( >> http://url.spec.whatwg.org/#concept-absolute-path-relative-url) or >> [path-relative URLs]( >> http://url.spec.whatwg.org/#concept-path-relative-url). (See Security >> Considerations later.) >> > >> > ### Fragment Identifiers for Packages >> > >> > A basic fragment identifier scheme for the `multipart/package` media >> type is of the form `file=url` where *url* is resolved against the base URL >> of the package, the part before the fragment in *url* is used to identify a >> file within the package using the `Content-Location` part header, and any >> fragment in *url* is used to identify a fragment within that file according >> to the media type for that file (as given in the `Content-Type` header). >> > >> > For example, the URL >> > >> > http://example.org/path/to/package.pack#file=/home.html%23section1 >> > >> > refers to the file within `http://example.org/path/to/package.pack`<http://example.org/path/to/package.pack>whose `Content-Location` is >> > >> > http://example.org/home.html >> > >> > and more specifically the element with the `id` `section1` within that >> file. This should be the same as `http://example.org/home.html#section1`<http://example.org/home.html#section1> >> . >> > >> > In general, links should be made directly to files on the web rather >> than to files within packages. The particular package(s) that a file >> appears in is an ephemeral phenomenon and not suitable for inclusion in a >> URL. >> > >> > ### `+package` Structured Suffix >> > >> > The `+package` structured suffix should be used on other multipart >> media types that are used for more specialised packages. This is >> particularly useful for package formats that must contain manifest files in >> particular formats; these should use the `+package` structured suffix in >> their media type. >> > >> > For example, a `multipart/widget+package` media type could specify that >> a web application package must contain an [`config.xml` configuration >> document](http://www.w3.org/TR/widgets/#configuration-document) using >> the `http://www.w3.org/ns/widgets` <http://www.w3.org/ns/widgets> XML >> vocabulary as the first file within the package, and that all other files >> in the package must be listed within this configuration file or ignored. >> > >> > Clients can treat any file with an unrecognised `+package` media type >> as if it were a `multipart/package` file. >> > >> > ## Requesting a Package >> > >> > Packages live on the web just like any other file. Thus it is perfectly >> possible to request a package directly. For example: >> > >> > GET /path/to/package.pack HTTP/1.1 >> > Accept: multipart/package,multipart/*,*/* >> > >> > should result in a response like: >> > >> > HTTP/1.1 200 OK >> > Content-Type: multipart/package;boundary=package-boundary >> > >> > ... package content ... >> > >> > This satisfies the second of the requirements described above, namely >> the easy distribution of self-contained content. Note that the `boundary` >> parameter is required for multipart media types as defined in RFC 2046. >> > >> > To locate a package of representations of related resources, to support >> the efficient delivery of scripts, stylesheets, images and so on over the >> web, we recommend the use of a new `package` link relation. This can be >> used within a `<link>` header in an HTML document: >> > >> > <link rel="package" href="/path/to/package.pack"> >> > >> > When the package is not HTML-based (for example if it is defined >> through a metadata file defined in JSON or XML), the `package` link >> relation can be used within a `Link` header: >> > >> > Link: </path/to/package.pack>; rel="package" >> > >> > ### Processing Packaged Content >> > >> > Clients that receive packaged content should unpackage by splitting the >> package on the boundary indicated in the media type. If there was no >> `Content-Type` header or no `boundary` parameter on the given content type >> then clients may recover by inferring the boundary from the content of the >> packaged content. >> > >> >> *Note: I guess that the `boundary` parameter is required because there >> were implementations at the time of standardisation that didn't start with >> `--boundary`. Now it's a real pain as it means multipart files can't be >> self-contained.* >> > >> > The `Content-Location`s associated with each packaged file must be >> resolved relative to the location of the package. If this results in a >> location that has a different origin from the package, the file must be >> ignored. >> > >> > For performance, it is good practice for the first file in the package >> to be the "root" of the package, referencing all the other files, but this >> cannot be relied upon by the client as the same package may be used for >> multiple resources. >> > >> > Other files may be cached for later use by the client, with headers set >> as appropriate based on those provided within the package and within the >> response to the original request. >> > >> > ## Security Implications >> > >> > When used with the `Package` header, the goal of a package is to >> populate a client's cache and prevent it from making additional unnecessary >> requests. Content from the cache will run with a base URL supplied within >> the package. If the `Content-Location`s given for the files in a package >> weren't restricted to the domain that the package is hosted on, `evil.com` >> could deliver content that it claimed came from `bank.com` and that >> would then be interpreted as if it did, indeed, come from `bank.com` but >> that ran scripts inserted by `evil.com`. >> > >> > For the same reason, publishers should be careful about the packages >> that they provide on their site, in just the same way as they should avoid >> hosting untrusted HTML. >> > >> > ## Rejected Approaches >> > >> > Other approaches to packaging have been used and were considered by the >> TAG. >> > >> > ### Zip as a Packaging Format >> > >> > The TAG discussed the use of zipped files as a packaging format. The >> main problem with using zips is that the *central directory record*, which >> lists the valid files within the zip archive, appears at the *end* of the >> zip. Implementations therefore need to wait until the whole zip is >> downloaded before the files within it can be read. This makes it unsuitable >> for a packaging format for efficient delivery of content on the web (the >> first of the requirements described above). >> > >> > A secondary problem with zip as a packaging format is that while there >> are mechanisms for supplying additional information about individual files >> within the package (through *extra fields*), they are not sufficient for >> extended metadata. Each extra field is a 2-byte ID code with a 2-byte >> value. The list of valid core and extended ID codes are provided within >> section 4.5 and 4.6 of the [zip definition]( >> http://www.pkware.com/documents/casestudies/APPNOTE.TXT). The file >> header within the zip, which includes these extra fields, must not exceed >> 64k in size. >> > >> > These limitations have resulted in people who use zip as a packaging >> format providing separate manifest files within the zip. >> > >> > ### Other Packaging Formats >> > >> > #### Mozilla Archive Format >> > >> > The [Mozilla Archive Format]( >> http://maf.mozdev.org/maff-specification.html) is a zip-based packaging >> format for web content which uses an RDF/XML manifest file within the zip >> to provide additional information about the content. This approach has the >> drawbacks described in the previous section, particularly lack of >> streamability. >> > >> > #### MHTML >> > >> > [RFC 2557](http://tools.ietf.org/html/rfc2557) defines MIME >> Encapsulation of Aggregate Documents, such as HTML (MHTML). This uses the >> `multipart/related` media type, with the first file in the package being >> the packaged HTML document and the remainder being related resources. >> > >> > This is not a suitable general format for publishing packages on the >> web as it is designed around an HTML page being the primary starting point >> for a package, which is not true in all circumstances. >> > >> > #### Webarchive >> > >> > The [Webarchive](http://en.wikipedia.org/wiki/Webarchive) format uses >> `application/x-webarchive` as a media type. It is a proprietary format >> defined by Apple and used within Safari. There is very little information >> available about its internal structure. >> > >> > #### WARC >> > >> > The [WARC]( >> http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml) format >> is used for archiving web content. Although it provides for packaging, and >> metadata for the files within the package, it is designed for archiving and >> is fairly heavyweight for the packages that are under discussion here, >> requiring `WARC-Record-ID`, `Content-Length`, `WARC-Date` and `WARC-Type` >> headers. >> > >> > ### Package Requests >> > >> > An approach to requesting packages that we considered would be to >> include a new `Package: true` header in HTTP requests for normal files on a >> web server. Servers that understand the `Package` header could then respond >> with a new `2XX Packaged Content` success response whose body is a package >> that includes a representation of the requested resource, along with >> representations of any other related resources. >> > >> > For example, a client that understood packaging would send a request >> like: >> > >> > GET /home.html HTTP/1.1 >> > Accept: >> text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 >> > Package: true >> > >> > The `Package` header with the value `true` would indicate that the >> server should attempt to respond with a package that includes the requested >> resource. If the server is a legacy server that does not understand the >> `Package` header, or if the server understands the `Package` header but >> does not have a suitable package with which it can respond, it will respond >> as normal to this request: >> > >> > HTTP/1.1 200 OK >> > Content-Type: text/html >> > >> > ... content of /home.html ... >> > >> > If the server understands the `Package` header and can respond with a >> package that contains the requested representation (`/home.html`) then it >> should respond with a `2XX Packaged Content` response. The >> `Content-Location` header in this response would indicate the location of >> the package: >> > >> > HTTP/1.1 2XX Packaged Content >> > Content-Type: multipart/package;boundary=package-boundary >> > Content-Location: /path/to/package.pack >> > >> > ... content of /path/to/package.pack ... >> > >> > A `2XX Packaged Content` response would indicate that the server is >> responding with a package that includes the same representation for the >> requested resource as would have been provided, with a `200 OK` response, >> if the `Package` header had not been present in the request. >> > >> > The problem with this approach is that it requires some fairly large >> changes to HTTP: a new HTTP header and a new HTTP status code. These are >> complicated to implement both in terms of specification and in terms of >> getting servers and clients to support them. New status codes in particular >> are difficult to plug in to popular web servers such as Apache. Using a >> non-standard status code also requires configuration access to servers, >> which isn't possible in many publishing environments. >> > >> > ### Specialising URLs >> > >> > The TAG [investigated the use of a special URL syntax]( >> https://gist.github.com/wycats/220039304b053b3eedd0) that would enable >> package-aware clients to work with packages whilst legacy clients and >> servers work with individual files. This approach is designed to meet the >> requirement that someone could use it on a file-system-based web server >> without access to any configuration options. In other words, it does not >> require servers to be package aware. >> > >> >> **Note: This requirement also entails using a self-contained packaging >> format. The multipart format described above is not self-contained because >> it requires the `boundary` parameter to be set via a `Content-Type` header. >> Zip packages or multipart packages nested inside `message/http` documents >> are alternative self-contained packaging formats.** >> > >> > For example, we explored using: >> > >> > http://example.com/path/to/package.pack!/home.html#section1 >> > >> > to indicate the anchor `section1` within the file `home.html` within >> the package `/path/to/package.pack`. The separator `!/` is a proposed >> unique separator between the package location and the location of the >> target file within the package. >> > >> > If someone wanted to provide packages for their files, they would >> structure their URL space so that it looked like: >> > >> > path/ >> > to/ >> > package.pack >> > package.pack!/ >> > home.html >> > >> > A package-aware client would recognise that the URL ` >> http://example.com/path/to/package.pack!/home.html#section1`<http://example.com/path/to/package.pack!/home.html#section1>contained the package separator `!/`. Instead of directly requesting the >> file `http://example.com/path/to/package.pack!/home.html`<http://example.com/path/to/package.pack!/home.html>as a legacy client would, it would request the file ` >> http://example.com/path/to/package.pack`<http://example.com/path/to/package.pack>, >> unpack the package, use the contents of the package to populate its cache, >> and then navigate to `http://example.com/path/to/package.pack!/home.html`<http://example.com/path/to/package.pack!/home.html>, >> which would then be within the cached content. >> > >> > The separator `!/` is designed such that it is unlikely to appear in >> existing URLs [TODO: some analysis on whether this is actually the case]. >> It is also designed to enable relative links to work. If there is a link >> within `home.html` to `faq.html` in the same package, you would want to >> write within the page simply: >> > >> > <a href="faq.html">FAQ</a> >> > >> > With a base URL of ` >> http://example.com/path/to/package.pack!/home.html#section1`<http://example.com/path/to/package.pack!/home.html#section1>such a link would resolve to ` >> http://example.com/path/to/package.pack!/faq.html`<http://example.com/path/to/package.pack!/faq.html>. >> Similarly, links that started with `.` or `..` would continue to resolve as >> expected; the package works exactly as a directory. >> > >> > This approach could be effectively polyfilled using [Service Workers]( >> https://github.com/slightlyoff/ServiceWorker/blob/master/explainer.md). >> The Service Worker would intercept two types of requests: >> > >> > 1. requests that include `!/` would be mapped into requests for the >> package; the resulting package would be used to populate a content cache >> containing the unpacked package >> > 2. further requests for pages that are controlled by the Service >> Worker would be fulfilled from the populated content cache where packaged >> content has been provided >> > >> > Implementation through Service Worker enables sites to use this >> packaging method without any cross-site standardisation effort. >> > >> > The biggest architectural problem with standardising this approach is >> that it places additional constraints on URL spaces, at least for items for >> which a package should be downloaded. As detailed in the [Internet Draft >> Standardising Structure in URIs]( >> http://tools.ietf.org/html/draft-ietf-appsawg-uri-get-off-my-lawn-00), >> there are risks when defining new standard internal structures within URLs: >> > >> > * **collisions**: the suggested convention of `!/` may clash with URL >> conventions used on other systems that have different best practices for >> URL structures >> > * **dilution**: the arrangement of files into packages is ephemeral >> information and does not reflect the semantic content of the files; it is >> bad practice to include ephemeral information in URLs as it makes those >> URLs likely to change, and therefore links to break >> > * **brittleness**: baking in a particular new URL structure into the >> web is a far reaching change that will be hard to change in the future >> > * **operational difficulty**: creating URLs containing `!/` may be >> difficult in some systems, for example where it is hard to create >> directories that contain the `!` character >> > * **client assumptions**: there may be existing URLs that contain the >> package delimiter (eg `!/`) that would break with new package-aware clients >> > >> > The issues of dilution and operational difficulty are particularly >> apparent when considering a file that should appear in multiple packages. >> The person managing the server would have to ensure it's duplicated >> whenever it's updated; those referencing the file would have to choose >> which instance of the file to reference depending on which other files >> should be packaged with it. >> > >> > ### Content Negotiation >> > >> > The TAG explored the use of content negotiation to retrieve a package >> of resources. In this scenario, a client that understood packages would >> include `multipart/package` as the most-favoured type of response: >> > >> > GET /home.html HTTP/1.1 >> > Accept: >> multipart/package,text/html;q=0.95,application/xhtml+xml;q=0.95,application/xml;q=0.9,image/webp,*/*;q=0.8 >> > >> > A server that had a package containing `/home.html` would respond with >> that package: >> > >> > HTTP/1.1 200 OK >> > Content-Type: >> multipart/package;boundary=boundary-in-home-package.pack >> > Content-Location: /home-package.pack >> > >> > There are three potential problems with this approach. >> > >> > First, a package that contains `/home.html` is arguably not a >> representation of the resource `/home.html`, only a container for such a >> representation. >> > >> >> *Note: It's not clear whether the fact that there's a mismatch in >> semantics actually has any implementation impact.* >> > >> > Second, the server would still need to use the rest of the `Accept` >> header to determine what to include within the package, or indeed whether a >> package can be created at all for the resource. For example, if the request >> had an `Accept` header of: >> > >> > Accept: multipart/package,text/json;q=0.9 >> > >> > then we would like the server to respond with a package that contained >> the `text/json` representation of the requested resource, or to give a `406 >> Not Acceptable` response if there was no such package. This ability to dig >> into the remaining part of the `Accept` header to determine a response >> would require revising the way in which the `Accept` header works, which we >> can't do. >> > >> > Third, there would be no mechanism to differentiate between requesting >> a package directly and requesting a package that contains a packaged >> resource. For example, say that CSV and metadata were packaged together >> into `multipart/package` files like `http://example.com/data.pack`<http://example.com/data.pack>. >> It would not be clear from a request like: >> > >> > GET /data.pack HTTP/1.1 >> > Accept: multipart/package >> > >> > whether the request was directly for `/data.pack` or for a package that >> contained `/data.pack`. >> > -- >> > Jeni Tennison >> > http://www.jenitennison.com/ >> > >> >> >> >> -- >> Herbert Van de Sompel >> Digital Library Research & Prototyping >> Los Alamos National Laboratory, Research Library >> http://public.lanl.gov/herbertv/ >> >> == >> >> >
Received on Sunday, 2 February 2014 02:27:58 UTC