- From: Stian Soiland-Reyes <soiland-reyes@cs.manchester.ac.uk>
- Date: Fri, 14 Nov 2014 14:36:55 +0000
- To: Hendy Irawan <ceefour666@gmail.com>
- Cc: Linked JSON <public-linked-json@w3.org>
Thanks for your interesting questions! My first thoughts: You might not want to put the Parquet dataset directly into the ZIP file, the maximum sizes I have tested without any issues using the Java 7 APIs are 1M * 1kb files and a big file of 5 GB - but other libraries/languages/OS-es have different limits. For instance opening a .zip in Windows Explorer will easily bail out at just 250 MB or 2000 files. But if you have a URI to the dataset (it doesn't have to be a clickable HTTP type), as any S3 resource would, then that can be aggregated as an external reference. You might want to include additional provenance as the content of an external URI can change - is the intention to bundle a dataset as-it-was or as-is-today for the reader? Let's say you know it will change, but you do not want to include a snapshot of it. You did however download one for producing whatever your RO Bundle is capturing. You can then do an absolute URI which you can't resolve: { "aggregates": [ { "uri": "urn:uuid:0734d1e9-28b9-4098-8838-22bf572beda2", "retrievedFrom": "https://mybucket.s3.amazonaws.com/0734d1e9-28b9-4098-8838-22bf572beda2"; "retrievedOn": "2013-05-21T14:24:19Z" ] } You can then use urn:uuid:0734d1e9-28b9-4098-8838-22bf572beda2 as the identifier for that Parquet dataset as it was back in May 2013, without falsily implying something about the current content. Alternatively you can do file://mycomputer.example.com/tmp/0734d1e9.parquet - but even that could be 'outside the Bundle' and easily subject to change. Bundling file:/// URIs should be a clear no-no and probably listed as a security concern. We did not want to make a hard distinction between what is in or what is out of the ZIP file - there might be many reasons for both. For instance you might want to keep a log file behind an authenticated HTTP server because it contains sensitive information about your infrastructure or your medical subjects, or you might want to bundle inside a massive satellite image, because that is the main reason of you making the bundle in the first timeAnnosys. You can add any kind of blobs or images to the ZIP (or externally by URI). You can also include things in the ZIP that are not considered 'aggregated' - although these would come without the promise of being kept through a read/write round-trip unless they are related as an annotation body. If the picture is just like a preview of a model, then I would relate it directly: { "aggregates": [ { "uri": "/that/big/file.csv" }, { "uri": "/preview/bigfile.png", "format": "image/png" }, ], "annotations": [ { "about": "/that/big/file.csv", "content": "/preview/bigfile.png" } ] } If it is something more complicated I would put that in a separate annotation body which is about both resources (e.g. relating them): { "aggregates": [ { "uri": "/that/big/file.csv" }, { "uri": "/preview/bigfile.png", "format": "image/png" }, ], "annotations": [ { "about": ["/that/big/file.csv", ""/preview/bigfile.png"], "content": "annotation/how-is-the-png-related-to-csv.jsonld" } ] } .. and then detail it in the annotation/how-is-the-png-related-to-csv.jsonld using JSON-LD and whatever vocabulary is appropriate - perhaps a detailed PROV-O trace? Sample data I would include as an annotation: { "annotations": [ { "about": "/that/big/file.csv", "content": "annotation/first-lines.csv", "oa:motivatedBy": { "@id": "oa:highlighting" } } ] } Here I added some additional JSON-LD from http://www.openannotation.org/spec/core/core.html#Motivations - minting a more specific motivation for 'sample' might be relevant (but without implying biological samples etc..) You can use the "conformsTo" relation to describe the schema - we added this for SBML models, but of course it can be used with any kind of standard or schema the resource is conforming to: { "aggregates": [ { "uri": "/my/file.xml", "conformsTo": "http://ns.taverna.org.uk/2008/xml/t2flow/t2flow.xsd", "createdOn": "2014-01-10T14:24:19Z" } ] } If the 'conformsTo' is a retrievable file (like an XML schema), it might be clever to bundle a copy of the schema - but I would not set 'conformsTo' to the bundled schema (as your intention was to conform with the published standard, not to a particular file). So I would then add a second item to "aggregates": { "uri": "schemas/t2flow.xsd", "retrievedFrom": "http://ns.taverna.org.uk/2008/xml/t2flow/t2flow.xsd", "retrievedOn": "2013-05-21T14:24:19Z" } So just with this little provenance you should have enough to set alarm bells going off for a seasoned data scientist - hang on, he created the XML file a year after retrieving the schema.. was it still the same? :) On 14 November 2014 12:45, Hendy Irawan <ceefour666@gmail.com> wrote: > Thank you Stian :) > > Does it work with Parquet datasets ? > > Is there a mechanism to describe the schema? e.g. Avro datasets has a schema > that can be included in the manifest. > > How to include "sample data" in the manifest? e.g. 3 first rows... > > How to link blobs/image attachments? > > Hendy > > Hendy Irawan - on Twitter - on LinkedIn > Web Developer | Bippo Indonesia | Akselerator Bisnis | Bandung > > On Fri, Nov 14, 2014 at 7:30 AM, Stian Soiland-Reyes > <soiland-reyes@cs.manchester.ac.uk> wrote: >> >> ... and a teaser manifest in JSON-LD: >> >> { >> "@context": ["https://w3id.org/bundle/context"], >> "id": "/", >> "manifest": "manifest.json", >> "createdOn": "2013-03-05T17:29:03Z", >> "createdBy": { >> "uri": "http://example.com/foaf#alice", >> "orcid": "http://orcid.org/0000-0002-1825-0097", >> "name": "Alice W. Land" }, >> "aggregates": [ >> { "uri": "http://example.com/blog/great-results" }, >> { "uri": "/dataset/results.csv", >> "mediatype": "text/csv", >> "createdBy": { >> "uri": "http://example.com/foaf#bob", >> "name": "Bob Barnsworth" }, >> "createdOn": "2013-02-12T19:37:32.939Z" }, >> ], >> "annotations": [ >> { "uri": "urn:uuid:d67466b4-3aeb-4855-8203-90febe71abdf", >> "about": "/dataset/results.csv", >> "content": "annotations/dataset-metadata.ttl" }, >> ] >> } >> >> Here, Alice has aggregated just two resources, a blog entry (external >> URI) and /dataset/results.csv (bundled), a CSV file which was created >> by Bob. There is an annotation with additional metadata about the CSV >> file, stored in /.ro/annotations/dataset-metadata.ttl within the >> bundle. >> >> On 14 November 2014 11:23, Stian Soiland-Reyes >> <soiland-reyes@cs.manchester.ac.uk> wrote: >> > I am proud to announce the updated Research Object Bundle 1.0, a >> > researchobject.org specification: >> > >> > https://w3id.org/bundle/2014-11-05/ >> > >> > >> > This specification defines RO Bundle, a ZIP-based file format that >> > bundles resources which when aggregated form an identifiable >> > conceptual work; say a collection of datasets resulting from a >> > scientific experiment, or a gathering of logs and outputs from a >> > particular command line execution. >> > >> > This specification is accompanied by two APIs for creating and >> > managing RO Bundles: >> > >> > Java: https://github.com/wf4ever/robundle/ >> > Ruby: https://github.com/myGrid/ruby-ro-bundle >> > >> > >> > >> > The RO Bundle include a manifest, aggregated resources (which might be >> > included as files in the ZIP or as external URIs), their annotations >> > and provenance for the purposes of exporting, archiving, publishing >> > and transferring the Research Object as a whole. >> > >> > The structure of the ZIP-file is decided by the user and/or >> > application, except for the reserved paths for the mediatype, JSON-LD >> > manifest and annotations. >> > >> > >> > RO Bundle relies on several existing RDF vocabularies : >> > * OAI-ORE - aggregation >> > * PROV - general provenance >> > * PAV - contributions and sources >> > * ORCID - identifying contributors >> > * FOAF - describing contributors >> > * OA - annotation on aggregated resources >> > * RO - research object model >> > >> > >> > For further comments or suggestions, feel free to use the mailing list >> > for the W3C ROSC community group - http://www.w3.org/community/rosc/ >> > or raise a Github issue/pull request at >> > >> > https://github.com/ResearchObject/specifications/tree/gh-pages/bundle/draft >> > >> > -- >> > Stian Soiland-Reyes, Manchester e-Science Lab >> > School of Computer Science >> > The University of Manchester >> > http://soiland-reyes.com/stian/work/ >> > http://orcid.org/0000-0001-9842-9718 >> >> >> >> -- >> Stian Soiland-Reyes, myGrid team >> School of Computer Science >> The University of Manchester >> http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718 >> > -- Stian Soiland-Reyes, myGrid team School of Computer Science The University of Manchester http://soiland-reyes.com/stian/work/ http://orcid.org/0000-0001-9842-9718
Received on Friday, 14 November 2014 14:37:44 UTC