Re: Packaging on the Web

On 21/01/2014 23:35, Larry Masinter wrote:
> I think you're wrong about multipart/related, in https://github.com/w3ctag/packaging-on-the-web where you say:
>
> "This is not a suitable general format for publishing packages on the web as it is designed around an HTML page being the primary starting point for a package, which is not true in all circumstances."
>
> There is nothing in the definition of multipart/related that restricts the "root" page to be HTML.
>

I was about to say the same thing!  The spec http://tools.ietf.org/html/rfc2387 
is quite clear about the options (type=, start=, etc.)

The cited spec (MHTML, http://tools.ietf.org/html/rfc2557) is specifically about 
using multipart/related for compound web documents.  (It doesn't help that the 
reference in the MHTML document to the multipart/related document is incorrect.)

For web use, I think a drawback of multipart/related is its use of content-id 
values (e.g. in the start parameter) rather than arbitrary URI references.  In 
my work on "research objects" (http://www.researchobject.org) I've found 
arbitrary relative URI references to be a very useful way of referencing 
components in packaged content.

...

Also, FYI, there's a W3C community group (a bit quiet at the moment, but I 
believe there are plans to re-invigorate it) for packaged scientific content - 
ROSC (http://www.w3.org/community/rosc/).  If there's a broader interest in 
packaged content, I'd strongly urge this group to participate in that.

#g
--


> Larry
> --
> http://larry.masinter.net
>
> <rant>
> It was certainly part of web architecture then  that text/html was one of many document markup formats, which only had the privilege of being "lingua franca" -- a common language that everyone understood. Content negotiation and other ways of sender discovering receiver capabilities would allow other (perhaps better) markup formats.
>
> Web architecture has now devolved into "everything is in HTML or an extension to it", and capability negotiation devolved into using user-agent mapped to feature/misfeature implementation capabilities, which seems much more fragile.
> </rant>
>
>
>> -----Original Message-----
>> From: algermissen1971 [mailto:algermissen1971@me.com]
>> Sent: Tuesday, January 21, 2014 8:45 AM
>> To: Jeni Tennison
>> Cc: www-tag@w3.org
>> Subject: Re: Packaging on the Web
>>
>> Jeni,
>>
>> On 21.01.2014, at 13:09, Jeni Tennison <jeni@jenitennison.com> wrote:
>>
>>> Hi,
>>>
>>> I took on the task of documenting the outcomes of our discussions around
>> packaging at the last F2F [1] and the one before that [2].
>>>
>>> I have written a very draft something at
>>>
>>>    https://github.com/w3ctag/packaging-on-the-web
>>
>> Unbelievable - I just did (almost) the same thing last week but did not write it
>> up yet.
>>
>> Great!
>>
>> Jan
>>
>>>
>>> which suggests a way forward that doesn't involve changing the way in which
>> URLs are parsed. Marcos has already added two issues against it (thanks
>> Marcos!). Other thoughts also welcome!
>>>
>>> I have copied below for those people who don't like following links from
>> mails and to aid discussion here.
>>>
>>> Cheers,
>>>
>>> Jeni
>>>
>>> [1] http://www.w3.org/2001/tag/2014/01/08-minutes.html#item07
>>> [2] http://www.w3.org/2001/tag/2013/10/01-minutes.html#item06
>>>
>>> ---
>>> # Packaging on the Web
>>>
>>> This document describes an approach for creating packages of files for use on
>> the web. The approach is to package them using a new `multipart/package`
>> media type and a `+package` structured syntax. To access packages related to
>> other files on the web, clients that understand packages of files look for a `Link`
>> header or (in HTML documents) a `<link>` element with a new link relation of
>> `package`. Other formats may define format-specific mechanisms for locating
>> related packages.
>>>
>>> **This is an unreviewed draft by Jeni Tennison and does not represent
>> official TAG or W3C opinion.**
>>>
>>> ## Requirements
>>>
>>> There are two main requirements for packages on the web:
>>>
>>>    * efficient delivery of related content on the web
>>>    * easy distribution of self-contained content
>>>
>>> There is also a final cross-cutting requirement: that the solution can be easily
>> and backwards-compatibly deployed on the web.
>>>
>>> ### Efficient Delivery
>>>
>>> If a user visits `http://www.bbc.co.uk/` they will need to download about 160
>> files to view the page in its entirety. The HTML page they download at
>> `http://www.bbc.co.uk/` contains references to stylesheets, scripts, images
>> and other files, each of which may contain references to further files
>> themselves.
>>>
>>> Delivering a package of these files could be more efficient than delivering
>> individual files. Downloading each file has a connection overhead which is
>> particularly impactful on low-bandwidth mobile devices and on secure
>> connections.
>>>
>>> The browser can't work out which additional files to download until it receives
>> the HTML page, but the server could plausibly deliver a package that contains
>> all the required files through a single connection. This would have to work in a
>> backwards compatible way for both older browsers interacting with package-
>> aware servers, and for package-aware clients working with older servers.
>>>
>>>> *Note: Efficient delivery is the aim of [pipelining in HTTP
>> 1.1](http://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html) (and
>> [HTTPbis](http://tools.ietf.org/html/draft-ietf-httpbis-p1-messaging-
>> 25#section-6.3.2)) and [multiplexing in HTTP
>> 2.0](http://tools.ietf.org/html/draft-ietf-httpbis-http2-09#section-2.2). These
>> enable multiple requests to be passed, and responded to, over the same
>> persistent connection. But they do not enable the server to deliver content for
>> predicted requests.*
>>>
>>>> *Note: The `rel=prefetch` link relation prompts the browser to access
>> additional HTML pages that have not yet been requested by the user, but again
>> this is a different facility; there is no such think as prefetching stylesheets,
>> scripts or images as these are already required for the page by the time the
>> browser knows about them.*
>>>
>>>> *Note: given that browsers can typically have more than one connection
>> open to a website, and download files in parallel, is there an argument for
>> supporting having multiple packages associated with a given page?*
>>>
>>>
>>> ### Self-Contained Content
>>>
>>> It is sometimes useful to distribute self-contained packages of material.
>> Examples are [Packaged Web Apps](http://www.w3.org/TR/widgets/) or
>> packages of CSV files used within the [Simple Data
>> Format](http://dataprotocols.org/simple-data-format/) or the [DataSet
>> Publishing Language](https://developers.google.com/public-data/). These
>> packages typically contain a manifest file, in a machine readable format (text,
>> JSON or XML), that contains further details and metadata about the files that
>> the package contains.
>>>
>>> ## Packaging Format
>>>
>>>> *Note: For rationale for selecting this above other packaging formats, see
>> 'Rejected Approaches' below.*
>>>
>>> The TAG recommends using [multipart media
>> types](http://tools.ietf.org/html/rfc2046#section-5.1) for packaging materials
>> on the web. Specifically we recommend the registration of a new
>> `multipart/package` media type and the registration of a new `+package`
>> structured syntax suffix, per [RFC
>> 6838](http://tools.ietf.org/html/rfc6838#section-6).
>>>
>>> Multipart media types are defined in [RFC
>> 2046](http://tools.ietf.org/html/rfc2046#section-5.1). They can contain one or
>> more *body parts*, each comprising:
>>>
>>>    * a *boundary*
>>>    * a number of *header fields*
>>>    * an empty line
>>>    * the *content* of the file
>>>
>>>> *Ed note: From what I can tell it's possible for the header fields within a
>> multipart response to be anything so long as it follows the normal `Header:
>> Value` syntax used in MIME and HTTP. So this could be a very flexible
>> mechanism for adding metadata.*
>>>
>>> The `multipart/mixed` media type places no constraints on which header
>> fields are specified within the multipart file (see the definition of `MIME-part-
>> headers` in [RFC 2045](http://tools.ietf.org/html/rfc2045#section-3)).
>>>
>>> The only difference between `multipart/mixed` and `multipart/package` is
>> that the header fields in every body part includes the `Content-Location`
>> header. Further, the values of the `Content-Location` header must all be
>> [absolute-path-relative URLs](http://url.spec.whatwg.org/#concept-absolute-
>> path-relative-url) or [path-relative
>> URLs](http://url.spec.whatwg.org/#concept-path-relative-url). (See Security
>> Considerations later.)
>>>
>>> ### Fragment Identifiers for Packages
>>>
>>> A basic fragment identifier scheme for the `multipart/package` media type is
>> of the form `file=url` where *url* is resolved against the base URL of the
>> package, the part before the fragment in *url* is used to identify a file within
>> the package using the `Content-Location` part header, and any fragment in
>> *url* is used to identify a fragment within that file according to the media type
>> for that file (as given in the `Content-Type` header).
>>>
>>> For example, the URL
>>>
>>>      http://example.org/path/to/package.pack#file=/home.html%23section1
>>>
>>> refers to the file within `http://example.org/path/to/package.pack` whose
>> `Content-Location` is
>>>
>>>      http://example.org/home.html
>>>
>>> and more specifically the element with the `id` `section1` within that file. This
>> should be the same as `http://example.org/home.html#section1`.
>>>
>>> In general, links should be made directly to files on the web rather than to
>> files within packages. The particular package(s) that a file appears in is an
>> ephemeral phenomenon and not suitable for inclusion in a URL.
>>>
>>> ### `+package` Structured Suffix
>>>
>>> The `+package` structured suffix should be used on other multipart media
>> types that are used for more specialised packages. This is particularly useful for
>> package formats that must contain manifest files in particular formats; these
>> should use the `+package` structured suffix in their media type.
>>>
>>> For example, a `multipart/widget+package` media type could specify that a
>> web application package must contain an [`config.xml` configuration
>> document](http://www.w3.org/TR/widgets/#configuration-document) using
>> the `http://www.w3.org/ns/widgets` XML vocabulary as the first file within the
>> package, and that all other files in the package must be listed within this
>> configuration file or ignored.
>>>
>>> Clients can treat any file with an unrecognised `+package` media type as if it
>> were a `multipart/package` file.
>>>
>>> ## Requesting a Package
>>>
>>> Packages live on the web just like any other file. Thus it is perfectly possible
>> to request a package directly. For example:
>>>
>>>      GET /path/to/package.pack HTTP/1.1
>>>      Accept: multipart/package,multipart/*,*/*
>>>
>>> should result in a response like:
>>>
>>>      HTTP/1.1 200 OK
>>>      Content-Type: multipart/package;boundary=package-boundary
>>>
>>>      ... package content ...
>>>
>>> This satisfies the second of the requirements described above, namely the
>> easy distribution of self-contained content. Note that the `boundary`
>> parameter is required for multipart media types as defined in RFC 2046.
>>>
>>> To locate a package of representations of related resources, to support the
>> efficient delivery of scripts, stylesheets, images and so on over the web, we
>> recommend the use of a new `package` link relation. This can be used within a
>> `<link>` header in an HTML document:
>>>
>>>      <link rel="package" href="/path/to/package.pack">
>>>
>>> When the package is not HTML-based (for example if it is defined through a
>> metadata file defined in JSON or XML), the `package` link relation can be used
>> within a `Link` header:
>>>
>>>      Link: </path/to/package.pack>; rel="package"
>>>
>>> ### Processing Packaged Content
>>>
>>> Clients that receive packaged content should unpackage by splitting the
>> package on the boundary indicated in the media type. If there was no
>> `Content-Type` header or no `boundary` parameter on the given content type
>> then clients may recover by inferring the boundary from the content of the
>> packaged content.
>>>
>>>> *Note: I guess that the `boundary` parameter is required because there
>> were implementations at the time of standardisation that didn't start with `--
>> boundary`. Now it's a real pain as it means multipart files can't be self-
>> contained.*
>>>
>>> The `Content-Location`s associated with each packaged file must be resolved
>> relative to the location of the package. If this results in a location that has a
>> different origin from the package, the file must be ignored.
>>>
>>> For performance, it is good practice for the first file in the package to be the
>> "root" of the package, referencing all the other files, but this cannot be relied
>> upon by the client as the same package may be used for multiple resources.
>>>
>>> Other files may be cached for later use by the client, with headers set as
>> appropriate based  on those provided within the package and within the
>> response to the original request.
>>>
>>> ## Security Implications
>>>
>>> When used with the `Package` header, the goal of a package is to populate a
>> client's cache and prevent it from making additional unnecessary requests.
>> Content from the cache will run with a base URL supplied within the package. If
>> the `Content-Location`s given for the files in a package weren't restricted to the
>> domain that the package is hosted on, `evil.com` could deliver content that it
>> claimed came from `bank.com` and that would then be interpreted as if it did,
>> indeed, come from `bank.com` but that ran scripts inserted by `evil.com`.
>>>
>>> For the same reason, publishers should be careful about the packages that
>> they provide on their site, in just the same way as they should avoid hosting
>> untrusted HTML.
>>>
>>> ## Rejected Approaches
>>>
>>> Other approaches to packaging have been used and were considered by the
>> TAG.
>>>
>>> ### Zip as a Packaging Format
>>>
>>> The TAG discussed the use of zipped files as a packaging format. The main
>> problem with using zips is that the *central directory record*, which lists the
>> valid files within the zip archive, appears at the *end* of the zip.
>> Implementations therefore need to wait until the whole zip is downloaded
>> before the files within it can be read. This makes it unsuitable for a packaging
>> format for efficient delivery of content on the web (the first of the
>> requirements described above).
>>>
>>> A secondary problem with zip as a packaging format is that while there are
>> mechanisms for supplying additional information about individual files within
>> the package (through *extra fields*), they are not sufficient for extended
>> metadata. Each extra field is a 2-byte ID code with a 2-byte value. The list of
>> valid core and extended ID codes are provided within section 4.5 and 4.6 of the
>> [zip
>> definition](http://www.pkware.com/documents/casestudies/APPNOTE.TXT).
>> The file header within the zip, which includes these extra fields, must not
>> exceed 64k in size.
>>>
>>> These limitations have resulted in people who use zip as a packaging format
>> providing separate manifest files within the zip.
>>>
>>> ### Other Packaging Formats
>>>
>>> #### Mozilla Archive Format
>>>
>>> The [Mozilla Archive Format](http://maf.mozdev.org/maff-
>> specification.html) is a zip-based packaging format for web content which uses
>> an RDF/XML manifest file within the zip to provide additional information about
>> the content. This approach has the drawbacks described in the previous
>> section, particularly lack of streamability.
>>>
>>> #### MHTML
>>>
>>> [RFC 2557](http://tools.ietf.org/html/rfc2557) defines MIME Encapsulation of
>> Aggregate Documents, such as HTML (MHTML). This uses the
>> `multipart/related` media type, with the first file in the package being the
>> packaged HTML document and the remainder being related resources.
>>>
>>> This is not a suitable general format for publishing packages on the web as it is
>> designed around an HTML page being the primary starting point for a package,
>> which is not true in all circumstances.
>>>
>>> #### Webarchive
>>>
>>> The [Webarchive](http://en.wikipedia.org/wiki/Webarchive) format uses
>> `application/x-webarchive` as a media type. It is a proprietary format defined
>> by Apple and used within Safari. There is very little information available about
>> its internal structure.
>>>
>>> #### WARC
>>>
>>> The
>> [WARC](http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml)
>> format is used for archiving web content. Although it provides for packaging,
>> and metadata for the files within the package, it is designed for archiving and is
>> fairly heavyweight for the packages that are under discussion here, requiring
>> `WARC-Record-ID`, `Content-Length`, `WARC-Date` and `WARC-Type` headers.
>>>
>>> ### Package Requests
>>>
>>> An approach to requesting packages that we considered would be to include
>> a new `Package: true` header in HTTP requests for normal files on a web server.
>> Servers that understand the `Package` header could then respond with a new
>> `2XX Packaged Content` success response whose body is a package that
>> includes a representation of the requested resource, along with
>> representations of any other related resources.
>>>
>>> For example, a client that understood packaging would send a request like:
>>>
>>>      GET /home.html HTTP/1.1
>>>      Accept:
>> text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
>>>      Package: true
>>>
>>> The `Package` header with the value `true` would indicate that the server
>> should attempt to respond with a package that includes the requested
>> resource. If the server is a legacy server that does not understand the
>> `Package` header, or if the server understands the `Package` header but does
>> not have a suitable package with which it can respond, it will respond as normal
>> to this request:
>>>
>>>      HTTP/1.1 200 OK
>>>      Content-Type: text/html
>>>
>>>      ... content of /home.html ...
>>>
>>> If the server understands the `Package` header and can respond with a
>> package that contains the requested representation (`/home.html`) then it
>> should respond with a `2XX Packaged Content` response. The `Content-
>> Location` header in this response would indicate the location of the package:
>>>
>>>      HTTP/1.1 2XX Packaged Content
>>>      Content-Type: multipart/package;boundary=package-boundary
>>>      Content-Location: /path/to/package.pack
>>>
>>>      ... content of /path/to/package.pack ...
>>>
>>> A `2XX Packaged Content` response would indicate that the server is
>> responding with a package that includes the same representation for the
>> requested resource as would have been provided, with a `200 OK` response, if
>> the `Package` header had not been present in the request.
>>>
>>> The problem with this approach is that it requires some fairly large changes to
>> HTTP: a new HTTP header and a new HTTP status code. These are complicated
>> to implement both in terms of specification and in terms of getting servers and
>> clients to support them. New status codes in particular are difficult to plug in to
>> popular web servers such as Apache. Using a non-standard status code also
>> requires configuration access to servers, which isn't possible in many publishing
>> environments.
>>>
>>> ### Specialising URLs
>>>
>>> The TAG [investigated the use of a special URL
>> syntax](https://gist.github.com/wycats/220039304b053b3eedd0) that would
>> enable package-aware clients to work with packages whilst legacy clients and
>> servers work with individual files. This approach is designed to meet the
>> requirement that someone could use it on a file-system-based web server
>> without access to any configuration options. In other words, it does not require
>> servers to be package aware.
>>>
>>>> **Note: This requirement also entails using a self-contained packaging
>> format. The multipart format described above is not self-contained because it
>> requires the `boundary` parameter to be set via a `Content-Type` header. Zip
>> packages or multipart packages nested inside `message/http` documents are
>> alternative self-contained packaging formats.**
>>>
>>> For example, we explored using:
>>>
>>>      http://example.com/path/to/package.pack!/home.html#section1
>>>
>>> to indicate the anchor `section1` within the file `home.html` within the
>> package `/path/to/package.pack`. The separator `!/` is a proposed unique
>> separator between the package location and the location of the target file
>> within the package.
>>>
>>> If someone wanted to provide packages for their files, they would structure
>> their URL space so that it looked like:
>>>
>>>      path/
>>>        to/
>>>          package.pack
>>>          package.pack!/
>>>            home.html
>>>
>>> A package-aware client would recognise that the URL
>> `http://example.com/path/to/package.pack!/home.html#section1` contained
>> the package separator `!/`. Instead of directly requesting the file
>> `http://example.com/path/to/package.pack!/home.html` as a legacy client
>> would, it would request the file `http://example.com/path/to/package.pack`,
>> unpack the package, use the contents of the package to populate its cache, and
>> then navigate to `http://example.com/path/to/package.pack!/home.html`,
>> which would then be within the cached content.
>>>
>>> The separator `!/` is designed such that it is unlikely to appear in existing URLs
>> [TODO: some analysis on whether this is actually the case]. It is also designed to
>> enable relative links to work. If there is a link within `home.html` to `faq.html` in
>> the same package, you would want to write within the page simply:
>>>
>>>      <a href="faq.html">FAQ</a>
>>>
>>> With a base URL of
>> `http://example.com/path/to/package.pack!/home.html#section1` such a link
>> would resolve to `http://example.com/path/to/package.pack!/faq.html`.
>> Similarly, links that started with `.` or `..` would continue to resolve as expected;
>> the package works exactly as a directory.
>>>
>>> This approach could be effectively polyfilled using [Service
>> Workers](https://github.com/slightlyoff/ServiceWorker/blob/master/explaine
>> r.md). The Service Worker would intercept two types of requests:
>>>
>>>    1. requests that include `!/` would be mapped into requests for the package;
>> the resulting package would be used to populate a content cache containing
>> the unpacked package
>>>    2. further requests for pages that are controlled by the Service Worker
>> would be fulfilled from the populated content cache where packaged content
>> has been provided
>>>
>>> Implementation through Service Worker enables sites to use this packaging
>> method without any cross-site standardisation effort.
>>>
>>> The biggest architectural problem with standardising this approach is that it
>> places additional constraints on URL spaces, at least for items for which a
>> package should be downloaded. As detailed in the [Internet Draft
>> Standardising Structure in URIs](http://tools.ietf.org/html/draft-ietf-appsawg-
>> uri-get-off-my-lawn-00), there are risks when defining new standard internal
>> structures within URLs:
>>>
>>>    * **collisions**: the suggested convention of `!/` may clash with URL
>> conventions used on other systems that have different best practices for URL
>> structures
>>>    * **dilution**: the arrangement of files into packages is ephemeral
>> information and does not reflect the semantic content of the files; it is bad
>> practice to include ephemeral information in URLs as it makes those URLs likely
>> to change, and therefore links to break
>>>    * **brittleness**: baking in a particular new URL structure into the web is a
>> far reaching change that will be hard to change in the future
>>>    * **operational difficulty**: creating URLs containing `!/` may be difficult in
>> some systems, for example where it is hard to create directories that contain
>> the `!` character
>>>    * **client assumptions**: there may be existing URLs that contain the
>> package delimiter (eg `!/`) that would break with new package-aware clients
>>>
>>> The issues of dilution and operational difficulty are particularly apparent
>> when considering a file that should appear in multiple packages. The person
>> managing the server would have to ensure it's duplicated whenever it's
>> updated; those referencing the file would have to choose which instance of
>> the file to reference depending on which other files should be packaged with
>> it.
>>>
>>> ### Content Negotiation
>>>
>>> The TAG explored the use of content negotiation to retrieve a package of
>> resources. In this scenario, a client that understood packages would include
>> `multipart/package` as the most-favoured type of response:
>>>
>>>      GET /home.html HTTP/1.1
>>>      Accept:
>> multipart/package,text/html;q=0.95,application/xhtml+xml;q=0.95,application/
>> xml;q=0.9,image/webp,*/*;q=0.8
>>>
>>> A server that had a package containing `/home.html` would respond with that
>> package:
>>>
>>>      HTTP/1.1 200 OK
>>>      Content-Type: multipart/package;boundary=boundary-in-home-
>> package.pack
>>>      Content-Location: /home-package.pack
>>>
>>> There are three potential problems with this approach.
>>>
>>> First, a package that contains `/home.html` is arguably not a representation of
>> the resource `/home.html`, only a container for such a representation.
>>>
>>>> *Note: It's not clear whether the fact that there's a mismatch in semantics
>> actually has any implementation impact.*
>>>
>>> Second, the server would still need to use the rest of the `Accept` header to
>> determine what to include within the package, or indeed whether a package
>> can be created at all for the resource. For example, if the request had an
>> `Accept` header of:
>>>
>>>      Accept: multipart/package,text/json;q=0.9
>>>
>>> then we would like the server to respond with a package that contained the
>> `text/json` representation of the requested resource, or to give a `406 Not
>> Acceptable` response if there was no such package. This ability to dig into the
>> remaining part of the `Accept` header to determine a response would require
>> revising the way in which the `Accept` header works, which we can't do.
>>>
>>> Third, there would be no mechanism to differentiate between requesting a
>> package directly and requesting a package that contains a packaged resource.
>> For example, say that CSV and metadata were packaged together into
>> `multipart/package` files like `http://example.com/data.pack`. It would not be
>> clear from a request like:
>>>
>>>      GET /data.pack HTTP/1.1
>>>      Accept: multipart/package
>>>
>>> whether the request was directly for `/data.pack` or for a package that
>> contained `/data.pack`.
>>> --
>>> Jeni Tennison
>>> http://www.jenitennison.com/
>>>
>
>
>

Received on Wednesday, 22 January 2014 10:29:17 UTC