Re: Packaging on the Web

On Tue, Jan 28, 2014 at 4:50 AM, Herbert van de Sompel <hvdsomp@gmail.com>wrote:

> Dear Jeni,
>
> I read this with quite some interest in relation to the OAI-ORE
> specification [1], an RDF-based approach for identifying and
> describing aggregations of web resources. In the Research Object work
> [2,3], the need was expressed to be able to package the description of
> an aggregation of web resources along with representations of the
> aggregated resources. The approach you describe looks rather
> attractive with this regard. The "root" document in case of OAI-ORE
> would be an RDF serialization.
>
> The following questions come up:
>
> (a) With different serializations of the root RDF document being
> available, how does a client request a preferred one? From the "
> Accept: multipart/package,text/json;q=0.9" example you provide, it
> seems the proposed approach does not provide that functionality.
> - Is my reading correct?
> - Could different representations of the root RDF document be included
> in the same package?
>

These packages are about  bits, "files" if you will, with (potentially
differing) URLs and mime types. I think you've mistaken them for something
higher level.


> (b) More generally, how does the proposed approach work with the 303
> style approaches of Cool URIs for the Semantic Web [4], i.e. how to
> express interest in a package with an RDF root document, an HTML root
> document?
>
> (c) The constraint to limit `Content-Location`s for the files in a
> package to the domain that the package is hosted on is understandable.
> Still, for example in the Research Object case, there can be a need to
> include a specific representation of a "remote" aggregated resource,
> e.g. a specific version of an evolving resource that was used to
> obtain a research result. In this case, does the constraint imply that
> the author of the aggregation needs to:
> - Collect that specific representation (this would have to be done
> whichever way)
> - Host that specific representation at a URI under its control
> - Include that specific representation in the package with that URI as
> Content-Location
> - Provide a link to the remote URI in the header fields for that
> specific representation with an appropriate relation type, e.g.
> Memento's "original"
>
> Greetings
>
> Herbert
>
> [1] http://www.openarchives.org/ore/1.0/datamodel
> [2] http://www.researchobject.org/
> [3] http://www.w3.org/community/rosc/
> [4] http://www.w3.org/TR/cooluris/
>
> On Tue, Jan 21, 2014 at 5:09 AM, Jeni Tennison <jeni@jenitennison.com>
> wrote:
> > Hi,
> >
> > I took on the task of documenting the outcomes of our discussions around
> packaging at the last F2F [1] and the one before that [2].
> >
> > I have written a very draft something at
> >
> >   https://github.com/w3ctag/packaging-on-the-web
> >
> > which suggests a way forward that doesn't involve changing the way in
> which URLs are parsed. Marcos has already added two issues against it
> (thanks Marcos!). Other thoughts also welcome!
> >
> > I have copied below for those people who don't like following links from
> mails and to aid discussion here.
> >
> > Cheers,
> >
> > Jeni
> >
> > [1] http://www.w3.org/2001/tag/2014/01/08-minutes.html#item07
> > [2] http://www.w3.org/2001/tag/2013/10/01-minutes.html#item06
> >
> > ---
> > # Packaging on the Web
> >
> > This document describes an approach for creating packages of files for
> use on the web. The approach is to package them using a new
> `multipart/package` media type and a `+package` structured syntax. To
> access packages related to other files on the web, clients that understand
> packages of files look for a `Link` header or (in HTML documents) a
> `<link>` element with a new link relation of `package`. Other formats may
> define format-specific mechanisms for locating related packages.
> >
> > **This is an unreviewed draft by Jeni Tennison and does not represent
> official TAG or W3C opinion.**
> >
> > ## Requirements
> >
> > There are two main requirements for packages on the web:
> >
> >   * efficient delivery of related content on the web
> >   * easy distribution of self-contained content
> >
> > There is also a final cross-cutting requirement: that the solution can
> be easily and backwards-compatibly deployed on the web.
> >
> > ### Efficient Delivery
> >
> > If a user visits `http://www.bbc.co.uk/` they will need to download
> about 160 files to view the page in its entirety. The HTML page they
> download at `http://www.bbc.co.uk/` contains references to stylesheets,
> scripts, images and other files, each of which may contain references to
> further files themselves.
> >
> > Delivering a package of these files could be more efficient than
> delivering individual files. Downloading each file has a connection
> overhead which is particularly impactful on low-bandwidth mobile devices
> and on secure connections.
> >
> > The browser can't work out which additional files to download until it
> receives the HTML page, but the server could plausibly deliver a package
> that contains all the required files through a single connection. This
> would have to work in a backwards compatible way for both older browsers
> interacting with package-aware servers, and for package-aware clients
> working with older servers.
> >
> >> *Note: Efficient delivery is the aim of [pipelining in HTTP 1.1](
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html) (and [HTTPbis](
> http://tools.ietf.org/html/draft-ietf-httpbis-p1-messaging-25#section-6.3.2))
> and [multiplexing in HTTP 2.0](
> http://tools.ietf.org/html/draft-ietf-httpbis-http2-09#section-2.2).
> These enable multiple requests to be passed, and responded to, over the
> same persistent connection. But they do not enable the server to deliver
> content for predicted requests.*
> >
> >> *Note: The `rel=prefetch` link relation prompts the browser to access
> additional HTML pages that have not yet been requested by the user, but
> again this is a different facility; there is no such think as prefetching
> stylesheets, scripts or images as these are already required for the page
> by the time the browser knows about them.*
> >
> >> *Note: given that browsers can typically have more than one connection
> open to a website, and download files in parallel, is there an argument for
> supporting having multiple packages associated with a given page?*
> >
> >
> > ### Self-Contained Content
> >
> > It is sometimes useful to distribute self-contained packages of
> material. Examples are [Packaged Web Apps](http://www.w3.org/TR/widgets/)
> or packages of CSV files used within the [Simple Data Format](
> http://dataprotocols.org/simple-data-format/) or the [DataSet Publishing
> Language](https://developers.google.com/public-data/). These packages
> typically contain a manifest file, in a machine readable format (text, JSON
> or XML), that contains further details and metadata about the files that
> the package contains.
> >
> > ## Packaging Format
> >
> >> *Note: For rationale for selecting this above other packaging formats,
> see 'Rejected Approaches' below.*
> >
> > The TAG recommends using [multipart media types](
> http://tools.ietf.org/html/rfc2046#section-5.1) for packaging materials
> on the web. Specifically we recommend the registration of a new
> `multipart/package` media type and the registration of a new `+package`
> structured syntax suffix, per [RFC 6838](
> http://tools.ietf.org/html/rfc6838#section-6).
> >
> > Multipart media types are defined in [RFC 2046](
> http://tools.ietf.org/html/rfc2046#section-5.1). They can contain one or
> more *body parts*, each comprising:
> >
> >   * a *boundary*
> >   * a number of *header fields*
> >   * an empty line
> >   * the *content* of the file
> >
> >> *Ed note: From what I can tell it's possible for the header fields
> within a multipart response to be anything so long as it follows the normal
> `Header: Value` syntax used in MIME and HTTP. So this could be a very
> flexible mechanism for adding metadata.*
> >
> > The `multipart/mixed` media type places no constraints on which header
> fields are specified within the multipart file (see the definition of
> `MIME-part-headers` in [RFC 2045](
> http://tools.ietf.org/html/rfc2045#section-3)).
> >
> > The only difference between `multipart/mixed` and `multipart/package` is
> that the header fields in every body part includes the `Content-Location`
> header. Further, the values of the `Content-Location` header must all be
> [absolute-path-relative URLs](
> http://url.spec.whatwg.org/#concept-absolute-path-relative-url) or
> [path-relative URLs](http://url.spec.whatwg.org/#concept-path-relative-url).
> (See Security Considerations later.)
> >
> > ### Fragment Identifiers for Packages
> >
> > A basic fragment identifier scheme for the `multipart/package` media
> type is of the form `file=url` where *url* is resolved against the base URL
> of the package, the part before the fragment in *url* is used to identify a
> file within the package using the `Content-Location` part header, and any
> fragment in *url* is used to identify a fragment within that file according
> to the media type for that file (as given in the `Content-Type` header).
> >
> > For example, the URL
> >
> >     http://example.org/path/to/package.pack#file=/home.html%23section1
> >
> > refers to the file within `http://example.org/path/to/package.pack`whose `Content-Location` is
> >
> >     http://example.org/home.html
> >
> > and more specifically the element with the `id` `section1` within that
> file. This should be the same as `http://example.org/home.html#section1`.
> >
> > In general, links should be made directly to files on the web rather
> than to files within packages. The particular package(s) that a file
> appears in is an ephemeral phenomenon and not suitable for inclusion in a
> URL.
> >
> > ### `+package` Structured Suffix
> >
> > The `+package` structured suffix should be used on other multipart media
> types that are used for more specialised packages. This is particularly
> useful for package formats that must contain manifest files in particular
> formats; these should use the `+package` structured suffix in their media
> type.
> >
> > For example, a `multipart/widget+package` media type could specify that
> a web application package must contain an [`config.xml` configuration
> document](http://www.w3.org/TR/widgets/#configuration-document) using the
> `http://www.w3.org/ns/widgets` XML vocabulary as the first file within
> the package, and that all other files in the package must be listed within
> this configuration file or ignored.
> >
> > Clients can treat any file with an unrecognised `+package` media type as
> if it were a `multipart/package` file.
> >
> > ## Requesting a Package
> >
> > Packages live on the web just like any other file. Thus it is perfectly
> possible to request a package directly. For example:
> >
> >     GET /path/to/package.pack HTTP/1.1
> >     Accept: multipart/package,multipart/*,*/*
> >
> > should result in a response like:
> >
> >     HTTP/1.1 200 OK
> >     Content-Type: multipart/package;boundary=package-boundary
> >
> >     ... package content ...
> >
> > This satisfies the second of the requirements described above, namely
> the easy distribution of self-contained content. Note that the `boundary`
> parameter is required for multipart media types as defined in RFC 2046.
> >
> > To locate a package of representations of related resources, to support
> the efficient delivery of scripts, stylesheets, images and so on over the
> web, we recommend the use of a new `package` link relation. This can be
> used within a `<link>` header in an HTML document:
> >
> >     <link rel="package" href="/path/to/package.pack">
> >
> > When the package is not HTML-based (for example if it is defined through
> a metadata file defined in JSON or XML), the `package` link relation can be
> used within a `Link` header:
> >
> >     Link: </path/to/package.pack>; rel="package"
> >
> > ### Processing Packaged Content
> >
> > Clients that receive packaged content should unpackage by splitting the
> package on the boundary indicated in the media type. If there was no
> `Content-Type` header or no `boundary` parameter on the given content type
> then clients may recover by inferring the boundary from the content of the
> packaged content.
> >
> >> *Note: I guess that the `boundary` parameter is required because there
> were implementations at the time of standardisation that didn't start with
> `--boundary`. Now it's a real pain as it means multipart files can't be
> self-contained.*
> >
> > The `Content-Location`s associated with each packaged file must be
> resolved relative to the location of the package. If this results in a
> location that has a different origin from the package, the file must be
> ignored.
> >
> > For performance, it is good practice for the first file in the package
> to be the "root" of the package, referencing all the other files, but this
> cannot be relied upon by the client as the same package may be used for
> multiple resources.
> >
> > Other files may be cached for later use by the client, with headers set
> as appropriate based  on those provided within the package and within the
> response to the original request.
> >
> > ## Security Implications
> >
> > When used with the `Package` header, the goal of a package is to
> populate a client's cache and prevent it from making additional unnecessary
> requests. Content from the cache will run with a base URL supplied within
> the package. If the `Content-Location`s given for the files in a package
> weren't restricted to the domain that the package is hosted on, `evil.com`
> could deliver content that it claimed came from `bank.com` and that would
> then be interpreted as if it did, indeed, come from `bank.com` but that
> ran scripts inserted by `evil.com`.
> >
> > For the same reason, publishers should be careful about the packages
> that they provide on their site, in just the same way as they should avoid
> hosting untrusted HTML.
> >
> > ## Rejected Approaches
> >
> > Other approaches to packaging have been used and were considered by the
> TAG.
> >
> > ### Zip as a Packaging Format
> >
> > The TAG discussed the use of zipped files as a packaging format. The
> main problem with using zips is that the *central directory record*, which
> lists the valid files within the zip archive, appears at the *end* of the
> zip. Implementations therefore need to wait until the whole zip is
> downloaded before the files within it can be read. This makes it unsuitable
> for a packaging format for efficient delivery of content on the web (the
> first of the requirements described above).
> >
> > A secondary problem with zip as a packaging format is that while there
> are mechanisms for supplying additional information about individual files
> within the package (through *extra fields*), they are not sufficient for
> extended metadata. Each extra field is a 2-byte ID code with a 2-byte
> value. The list of valid core and extended ID codes are provided within
> section 4.5 and 4.6 of the [zip definition](
> http://www.pkware.com/documents/casestudies/APPNOTE.TXT). The file header
> within the zip, which includes these extra fields, must not exceed 64k in
> size.
> >
> > These limitations have resulted in people who use zip as a packaging
> format providing separate manifest files within the zip.
> >
> > ### Other Packaging Formats
> >
> > #### Mozilla Archive Format
> >
> > The [Mozilla Archive Format](
> http://maf.mozdev.org/maff-specification.html) is a zip-based packaging
> format for web content which uses an RDF/XML manifest file within the zip
> to provide additional information about the content. This approach has the
> drawbacks described in the previous section, particularly lack of
> streamability.
> >
> > #### MHTML
> >
> > [RFC 2557](http://tools.ietf.org/html/rfc2557) defines MIME
> Encapsulation of Aggregate Documents, such as HTML (MHTML). This uses the
> `multipart/related` media type, with the first file in the package being
> the packaged HTML document and the remainder being related resources.
> >
> > This is not a suitable general format for publishing packages on the web
> as it is designed around an HTML page being the primary starting point for
> a package, which is not true in all circumstances.
> >
> > #### Webarchive
> >
> > The [Webarchive](http://en.wikipedia.org/wiki/Webarchive) format uses
> `application/x-webarchive` as a media type. It is a proprietary format
> defined by Apple and used within Safari. There is very little information
> available about its internal structure.
> >
> > #### WARC
> >
> > The [WARC](
> http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml) format is
> used for archiving web content. Although it provides for packaging, and
> metadata for the files within the package, it is designed for archiving and
> is fairly heavyweight for the packages that are under discussion here,
> requiring `WARC-Record-ID`, `Content-Length`, `WARC-Date` and `WARC-Type`
> headers.
> >
> > ### Package Requests
> >
> > An approach to requesting packages that we considered would be to
> include a new `Package: true` header in HTTP requests for normal files on a
> web server. Servers that understand the `Package` header could then respond
> with a new `2XX Packaged Content` success response whose body is a package
> that includes a representation of the requested resource, along with
> representations of any other related resources.
> >
> > For example, a client that understood packaging would send a request
> like:
> >
> >     GET /home.html HTTP/1.1
> >     Accept:
> text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
> >     Package: true
> >
> > The `Package` header with the value `true` would indicate that the
> server should attempt to respond with a package that includes the requested
> resource. If the server is a legacy server that does not understand the
> `Package` header, or if the server understands the `Package` header but
> does not have a suitable package with which it can respond, it will respond
> as normal to this request:
> >
> >     HTTP/1.1 200 OK
> >     Content-Type: text/html
> >
> >     ... content of /home.html ...
> >
> > If the server understands the `Package` header and can respond with a
> package that contains the requested representation (`/home.html`) then it
> should respond with a `2XX Packaged Content` response. The
> `Content-Location` header in this response would indicate the location of
> the package:
> >
> >     HTTP/1.1 2XX Packaged Content
> >     Content-Type: multipart/package;boundary=package-boundary
> >     Content-Location: /path/to/package.pack
> >
> >     ... content of /path/to/package.pack ...
> >
> > A `2XX Packaged Content` response would indicate that the server is
> responding with a package that includes the same representation for the
> requested resource as would have been provided, with a `200 OK` response,
> if the `Package` header had not been present in the request.
> >
> > The problem with this approach is that it requires some fairly large
> changes to HTTP: a new HTTP header and a new HTTP status code. These are
> complicated to implement both in terms of specification and in terms of
> getting servers and clients to support them. New status codes in particular
> are difficult to plug in to popular web servers such as Apache. Using a
> non-standard status code also requires configuration access to servers,
> which isn't possible in many publishing environments.
> >
> > ### Specialising URLs
> >
> > The TAG [investigated the use of a special URL syntax](
> https://gist.github.com/wycats/220039304b053b3eedd0) that would enable
> package-aware clients to work with packages whilst legacy clients and
> servers work with individual files. This approach is designed to meet the
> requirement that someone could use it on a file-system-based web server
> without access to any configuration options. In other words, it does not
> require servers to be package aware.
> >
> >> **Note: This requirement also entails using a self-contained packaging
> format. The multipart format described above is not self-contained because
> it requires the `boundary` parameter to be set via a `Content-Type` header.
> Zip packages or multipart packages nested inside `message/http` documents
> are alternative self-contained packaging formats.**
> >
> > For example, we explored using:
> >
> >     http://example.com/path/to/package.pack!/home.html#section1
> >
> > to indicate the anchor `section1` within the file `home.html` within the
> package `/path/to/package.pack`. The separator `!/` is a proposed unique
> separator between the package location and the location of the target file
> within the package.
> >
> > If someone wanted to provide packages for their files, they would
> structure their URL space so that it looked like:
> >
> >     path/
> >       to/
> >         package.pack
> >         package.pack!/
> >           home.html
> >
> > A package-aware client would recognise that the URL `
> http://example.com/path/to/package.pack!/home.html#section1` contained
> the package separator `!/`. Instead of directly requesting the file `
> http://example.com/path/to/package.pack!/home.html` as a legacy client
> would, it would request the file `http://example.com/path/to/package.pack`,
> unpack the package, use the contents of the package to populate its cache,
> and then navigate to `http://example.com/path/to/package.pack!/home.html`,
> which would then be within the cached content.
> >
> > The separator `!/` is designed such that it is unlikely to appear in
> existing URLs [TODO: some analysis on whether this is actually the case].
> It is also designed to enable relative links to work. If there is a link
> within `home.html` to `faq.html` in the same package, you would want to
> write within the page simply:
> >
> >     <a href="faq.html">FAQ</a>
> >
> > With a base URL of `
> http://example.com/path/to/package.pack!/home.html#section1` such a link
> would resolve to `http://example.com/path/to/package.pack!/faq.html`.
> Similarly, links that started with `.` or `..` would continue to resolve as
> expected; the package works exactly as a directory.
> >
> > This approach could be effectively polyfilled using [Service Workers](
> https://github.com/slightlyoff/ServiceWorker/blob/master/explainer.md).
> The Service Worker would intercept two types of requests:
> >
> >   1. requests that include `!/` would be mapped into requests for the
> package; the resulting package would be used to populate a content cache
> containing the unpacked package
> >   2. further requests for pages that are controlled by the Service
> Worker would be fulfilled from the populated content cache where packaged
> content has been provided
> >
> > Implementation through Service Worker enables sites to use this
> packaging method without any cross-site standardisation effort.
> >
> > The biggest architectural problem with standardising this approach is
> that it places additional constraints on URL spaces, at least for items for
> which a package should be downloaded. As detailed in the [Internet Draft
> Standardising Structure in URIs](
> http://tools.ietf.org/html/draft-ietf-appsawg-uri-get-off-my-lawn-00),
> there are risks when defining new standard internal structures within URLs:
> >
> >   * **collisions**: the suggested convention of `!/` may clash with URL
> conventions used on other systems that have different best practices for
> URL structures
> >   * **dilution**: the arrangement of files into packages is ephemeral
> information and does not reflect the semantic content of the files; it is
> bad practice to include ephemeral information in URLs as it makes those
> URLs likely to change, and therefore links to break
> >   * **brittleness**: baking in a particular new URL structure into the
> web is a far reaching change that will be hard to change in the future
> >   * **operational difficulty**: creating URLs containing `!/` may be
> difficult in some systems, for example where it is hard to create
> directories that contain the `!` character
> >   * **client assumptions**: there may be existing URLs that contain the
> package delimiter (eg `!/`) that would break with new package-aware clients
> >
> > The issues of dilution and operational difficulty are particularly
> apparent when considering a file that should appear in multiple packages.
> The person managing the server would have to ensure it's duplicated
> whenever it's updated; those referencing the file would have to choose
> which instance of the file to reference depending on which other files
> should be packaged with it.
> >
> > ### Content Negotiation
> >
> > The TAG explored the use of content negotiation to retrieve a package of
> resources. In this scenario, a client that understood packages would
> include `multipart/package` as the most-favoured type of response:
> >
> >     GET /home.html HTTP/1.1
> >     Accept:
> multipart/package,text/html;q=0.95,application/xhtml+xml;q=0.95,application/xml;q=0.9,image/webp,*/*;q=0.8
> >
> > A server that had a package containing `/home.html` would respond with
> that package:
> >
> >     HTTP/1.1 200 OK
> >     Content-Type:
> multipart/package;boundary=boundary-in-home-package.pack
> >     Content-Location: /home-package.pack
> >
> > There are three potential problems with this approach.
> >
> > First, a package that contains `/home.html` is arguably not a
> representation of the resource `/home.html`, only a container for such a
> representation.
> >
> >> *Note: It's not clear whether the fact that there's a mismatch in
> semantics actually has any implementation impact.*
> >
> > Second, the server would still need to use the rest of the `Accept`
> header to determine what to include within the package, or indeed whether a
> package can be created at all for the resource. For example, if the request
> had an `Accept` header of:
> >
> >     Accept: multipart/package,text/json;q=0.9
> >
> > then we would like the server to respond with a package that contained
> the `text/json` representation of the requested resource, or to give a `406
> Not Acceptable` response if there was no such package. This ability to dig
> into the remaining part of the `Accept` header to determine a response
> would require revising the way in which the `Accept` header works, which we
> can't do.
> >
> > Third, there would be no mechanism to differentiate between requesting a
> package directly and requesting a package that contains a packaged
> resource. For example, say that CSV and metadata were packaged together
> into `multipart/package` files like `http://example.com/data.pack`. It
> would not be clear from a request like:
> >
> >     GET /data.pack HTTP/1.1
> >     Accept: multipart/package
> >
> > whether the request was directly for `/data.pack` or for a package that
> contained `/data.pack`.
> > --
> > Jeni Tennison
> > http://www.jenitennison.com/
> >
>
>
>
> --
> Herbert Van de Sompel
> Digital Library Research & Prototyping
> Los Alamos National Laboratory, Research Library
> http://public.lanl.gov/herbertv/
>
> ==
>
>

Received on Friday, 31 January 2014 21:51:21 UTC