Re: Research Object Bundle 1.0 from Stian Soiland-Reyes on 2014-11-14 (public-linked-json@w3.org from November 2014)

From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
Date: Fri, 14 Nov 2014 14:36:55 +0000
To: Hendy Irawan <ceefour666@gmail.com>
Cc: Linked JSON <public-linked-json@w3.org>
Message-ID: <CAPRnXtmKUHL4hcMXb8guTsJcZpv76Mbu4GwKJPyzeEMW=Vxjmw@mail.gmail.com>
Thanks for your interesting questions! My first thoughts:


You might not want to put the Parquet dataset directly into the ZIP
file, the maximum sizes I have tested without any issues using the
Java 7 APIs are 1M * 1kb files and a big file of 5 GB - but other
libraries/languages/OS-es have different limits. For instance opening
a .zip in Windows Explorer will easily bail out at just 250 MB or 2000
files.


But if you have a URI to the dataset (it doesn't have to be a
clickable HTTP type), as any S3 resource would, then that can be
aggregated as an external reference. You might want to include
additional provenance as the content of an external URI can change -
is the intention to bundle a dataset as-it-was or as-is-today for the
reader?

Let's say you know it will change, but you do not want to include a
snapshot of it. You did however download one for producing whatever
your RO Bundle is capturing.  You can then do an absolute URI which
you can't resolve:

    { "aggregates": [
       { "uri": "urn:uuid:0734d1e9-28b9-4098-8838-22bf572beda2",
         "retrievedFrom":
"https://mybucket.s3.amazonaws.com/0734d1e9-28b9-4098-8838-22bf572beda2";
         "retrievedOn": "2013-05-21T14:24:19Z"
      ]
    }

You can then use urn:uuid:0734d1e9-28b9-4098-8838-22bf572beda2 as the
identifier for that Parquet dataset as it was back in May 2013,
without falsily implying something about the current content.
Alternatively you can do
file://mycomputer.example.com/tmp/0734d1e9.parquet - but even that
could be 'outside the Bundle' and easily subject to change. Bundling
file:/// URIs should be a clear no-no and probably listed as a
security concern.


We did not want to make a hard distinction between what is in or what
is out of the ZIP file - there might be many reasons for both. For
instance you might want to keep a log file behind an authenticated
HTTP server because it contains sensitive information about your
infrastructure or your medical subjects, or you might want to bundle
inside a massive satellite image, because that is the main reason of
you making the bundle in the first timeAnnosys.



You can add any kind of blobs or images to the ZIP (or externally by
URI). You can also include things in the ZIP that are not considered
'aggregated' - although these would come without the promise of being
kept through a read/write round-trip unless they are related as an
annotation body.

If the picture is just like a preview of a model, then I would relate
it directly:

   { "aggregates": [
         { "uri": "/that/big/file.csv" },
         { "uri": "/preview/bigfile.png", "format": "image/png" },
      ],
      "annotations": [
        { "about": "/that/big/file.csv",
          "content": "/preview/bigfile.png" }
       ]
    }




 If it is something more complicated I would put that in a separate
annotation body which is about both resources (e.g. relating them):

   { "aggregates": [
         { "uri": "/that/big/file.csv" },
         { "uri": "/preview/bigfile.png", "format": "image/png" },
      ],
      "annotations": [
        { "about": ["/that/big/file.csv",  ""/preview/bigfile.png"],
          "content": "annotation/how-is-the-png-related-to-csv.jsonld" }
       ]
    }


.. and then detail it in the
annotation/how-is-the-png-related-to-csv.jsonld using JSON-LD and
whatever vocabulary is appropriate - perhaps a detailed PROV-O trace?




Sample data I would include as an annotation:

    { "annotations": [
       { "about": "/that/big/file.csv",
         "content": "annotation/first-lines.csv",
         "oa:motivatedBy": { "@id": "oa:highlighting" }
        }
     ]
    }

Here I added some additional JSON-LD from
http://www.openannotation.org/spec/core/core.html#Motivations -
minting a more specific motivation for 'sample' might be relevant (but
without implying biological samples etc..)



You can use the "conformsTo" relation to describe the schema - we
added this for SBML models, but of course it can be used with any kind
of standard or schema the resource is conforming to:


    { "aggregates": [
       { "uri": "/my/file.xml",
         "conformsTo": "http://ns.taverna.org.uk/2008/xml/t2flow/t2flow.xsd",
         "createdOn": "2014-01-10T14:24:19Z" }
     ]
    }

If the 'conformsTo' is a retrievable file (like an XML schema), it
might be clever to bundle a copy of the schema - but I would not set
'conformsTo' to the bundled schema (as your intention was to conform
with the published standard, not to a particular file). So I would
then add a second item to "aggregates":

    { "uri": "schemas/t2flow.xsd",
      "retrievedFrom": "http://ns.taverna.org.uk/2008/xml/t2flow/t2flow.xsd",
      "retrievedOn": "2013-05-21T14:24:19Z"
    }

So just with this little provenance you should have enough to set
alarm bells going off for a seasoned data scientist - hang on, he
created the XML file a year after retrieving the schema.. was it still
the same? :)

On 14 November 2014 12:45, Hendy Irawan <ceefour666@gmail.com> wrote:
> Thank you Stian :)
>
> Does it work with Parquet datasets ?
>
> Is there a mechanism to describe the schema? e.g. Avro datasets has a schema
> that can be included in the manifest.
>
> How to include "sample data" in the manifest? e.g. 3 first rows...
>
> How to link blobs/image attachments?
>
> Hendy
>
> Hendy Irawan - on Twitter - on LinkedIn
> Web Developer | Bippo Indonesia | Akselerator Bisnis | Bandung
>
> On Fri, Nov 14, 2014 at 7:30 AM, Stian Soiland-Reyes
> <soiland-reyes@cs.manchester.ac.uk> wrote:
>>
>> ... and a teaser manifest in JSON-LD:
>>
>> {
>>     "@context":  ["https://w3id.org/bundle/context"],
>>     "id": "/",
>>     "manifest":  "manifest.json",
>>     "createdOn": "2013-03-05T17:29:03Z",
>>     "createdBy": {
>>         "uri":     "http://example.com/foaf#alice",
>>         "orcid":   "http://orcid.org/0000-0002-1825-0097",
>>         "name":    "Alice W. Land" },
>>     "aggregates": [
>>        { "uri":  "http://example.com/blog/great-results" },
>>        { "uri":      "/dataset/results.csv",
>>          "mediatype": "text/csv",
>>          "createdBy": {
>>              "uri":     "http://example.com/foaf#bob",
>>              "name":    "Bob Barnsworth" },
>>          "createdOn": "2013-02-12T19:37:32.939Z" },
>>     ],
>>     "annotations": [
>>       { "uri":     "urn:uuid:d67466b4-3aeb-4855-8203-90febe71abdf",
>>         "about":   "/dataset/results.csv",
>>         "content": "annotations/dataset-metadata.ttl" },
>>     ]
>> }
>>
>> Here, Alice has aggregated just two resources, a blog entry (external
>> URI) and /dataset/results.csv (bundled), a CSV file which was created
>> by Bob. There is an annotation with additional metadata about the CSV
>> file, stored in /.ro/annotations/dataset-metadata.ttl within the
>> bundle.
>>
>> On 14 November 2014 11:23, Stian Soiland-Reyes
>> <soiland-reyes@cs.manchester.ac.uk> wrote:
>> > I am proud to announce the updated Research Object Bundle 1.0, a
>> > researchobject.org specification:
>> >
>> >     https://w3id.org/bundle/2014-11-05/
>> >
>> >
>> > This specification defines RO Bundle, a ZIP-based file format that
>> > bundles resources which when aggregated form an identifiable
>> > conceptual work; say a collection of datasets resulting from a
>> > scientific experiment, or a gathering of logs and outputs from a
>> > particular command line execution.
>> >
>> > This specification is accompanied by two APIs for creating and
>> > managing RO Bundles:
>> >
>> > Java: https://github.com/wf4ever/robundle/
>> > Ruby: https://github.com/myGrid/ruby-ro-bundle
>> >
>> >
>> >
>> > The RO Bundle include a manifest, aggregated resources (which might be
>> > included as files in the ZIP or as external URIs), their annotations
>> > and provenance for the purposes of exporting, archiving, publishing
>> > and transferring the Research Object as a whole.
>> >
>> > The structure of the ZIP-file is decided by the user and/or
>> > application, except for the reserved paths for the mediatype, JSON-LD
>> > manifest and annotations.
>> >
>> >
>> > RO Bundle relies on several existing RDF vocabularies :
>> >  * OAI-ORE - aggregation
>> >  * PROV - general provenance
>> >  * PAV - contributions and sources
>> >  * ORCID - identifying contributors
>> >  * FOAF - describing contributors
>> >  * OA - annotation on aggregated resources
>> >  * RO - research object model
>> >
>> >
>> > For further comments or suggestions, feel free to use the mailing list
>> > for the W3C ROSC community group - http://www.w3.org/community/rosc/
>> > or raise a Github issue/pull request at
>> >
>> > https://github.com/ResearchObject/specifications/tree/gh-pages/bundle/draft
>> >
>> > --
>> > Stian Soiland-Reyes, Manchester e-Science Lab
>> > School of Computer Science
>> > The University of Manchester
>> > http://soiland-reyes.com/stian/work/
>> > http://orcid.org/0000-0001-9842-9718
>>
>>
>>
>> --
>> Stian Soiland-Reyes, myGrid team
>> School of Computer Science
>> The University of Manchester
>> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
>>
>



-- 
Stian Soiland-Reyes, myGrid team
School of Computer Science
The University of Manchester
http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
Received on Friday, 14 November 2014 14:37:44 UTC