- From: Stian Soiland-Reyes <soiland-reyes@manchester.ac.uk>
- Date: Wed, 12 Feb 2020 14:39:26 +0000
- To: Erich Bremer <erich@ebremer.com>
- Cc: public-rosc@w3.org
On Tue, 11 Feb 2020 15:24:33 -0500, Erich Bremer <erich@ebremer.com> wrote:
> In storing the files in the RO Crate zip file, is there a preference for
> RDF property for representing an MD5 or SHA-512 hash of the file that is
> being stored in the RO crate zip file? There didn't seem to be one at
> schema.org but I did find the following:
> http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions.html
We have not listed an RDF property for hash within the RO-Crate manifest,
as we largely considered that a "transport-level" detail that is better
covered by BagIt or Oxford Common File Layout.
https://w3id.org/ro/crate/1.0#combining-with-other-packaging-schemes
There you should probably use SHA-256 or SHA-512 so it's
cryptographically strong, MD5 and SHA-1 should be avoided where
possible.
Agree that the loc ontology you link to show good identifiers for the
hash *functions*, but it does not provide RDF properties for linking to a
particular hash.
I guess you *could* re-purpose URIs like
<http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256>
as a property, but then why didn't LOC declare them also as such, given
that they have other vocabularies?
We should not use it blindly as a property without agreeing what should
be a valid subject and object for its use - e.g. would these theoretical
properties expect the hash value as bytes, a hex string (with or without
spaces? Upper case, lower case or both?), or as a separate HashValue
resource?
One possibility, if you can avoid insecure MD5 and SHA1, is to
use RFC6920 nih: URIs as identifiers https://tools.ietf.org/html/rfc6920
(or shorter ni: which use base64 encoding)
for instance:
{ "@context": "https://w3id.org/ro/crate/1.0/context",
"@graph": [
{
"@type": "CreativeWork",
"@id": "ro-crate-metadata.jsonld",
"conformsTo": {"@id": "https://w3id.org/ro/crate/1.0"},
"about": {"@id": "./"},
"description": "RO-Crate Metadata File Descriptor (this file)"
},
{
"@id": "./",
"@type": "Dataset",
"name": "Example RO-Crate",
"description": "The RO-Crate Root Data Entity",
"hasPart": [
{"@id": "data1.txt"},
]
},
{
"@id": "data1.txt",
"@type": "File",
"description": "One of hopefully many Data Entities",
"identifier": { "@id": "nih:sha-256;5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03"}
},
]
}
An advantage of NI is that they can be rewritten to .well-known http
URIs for retrieval - you can then retrieve from any supporting content-delivery
platform as you can check the hash afterwards
https://tools.ietf.org/html/rfc6920#section-4
As for RO-Crate a more elaborate alternative where you won't need to
parse the URI is to use a https://schema.org/PropertyValue similar to
https://w3id.org/ro/crate/1.0/#repository-specific-identifiers
but linking to the id.loc identifiers:
{
"@id": "data1.txt",
"@type": "File",
"description": "A file containing only the line: hello"
"identifier": { "@id": "nih:sha-256;5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03"}
},
{ "@id": "nih:sha-256;5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03",
"@type": "PropertyValue",
"name": "sha256",
"unitText", "hexadecimal"
"propertyID": { "@id": "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256"},
"value": "5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03"
}
Here I used the more lightweight "propertyID" and "hexadecimal" as
unitText, so it would be more of a convention. The RFC6920 URIs are far
more rigorously defined and thus my preference, but at the expense of
parsing.
In this combined approach you get best of both worlds - you have a
global content-based @id URI for the data file (content), and you have
exposed the sha256 hash value as a separate property so you don't need
to parse that URI.
BTW, the RO-Crate community don't usually communicate on this list (but
perhaps we should), could you raise this as a Use Case on
https://github.com/researchobject/ro-crate/issues ?
Feel free to link to my reply!
We can then discuss it on the next RO-Crate telcon, which is
scheduled for 2020-02-27
<https://s.apache.org/ro-crate-minutes>
It might also be worth asking on the schema.org list as there might be
others there dealing with hash values.
--
Stian Soiland-Reyes
The University of Manchester 🐝
https://www.esciencelab.org.uk/
https://orcid.org/0000-0001-9842-9718
Received on Wednesday, 12 February 2020 14:39:47 UTC