Re: Cryptographic Hash Functions

On Tue, 11 Feb 2020 15:24:33 -0500, Erich Bremer <erich@ebremer.com> wrote:
> In storing the files in the RO Crate zip file, is there a preference for
> RDF property for representing an MD5 or SHA-512 hash of the file that is
> being stored in the RO crate zip file?  There didn't seem to be one at
> schema.org but I did find the following:
> http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions.html

We have not listed an RDF property for hash within the RO-Crate manifest,
as we largely considered that a "transport-level" detail that is better
covered by BagIt or Oxford Common File Layout.
https://w3id.org/ro/crate/1.0#combining-with-other-packaging-schemes

There you should probably use SHA-256 or SHA-512 so it's
cryptographically strong, MD5 and SHA-1 should be avoided where
possible. 


Agree that the loc ontology you link to show good identifiers for the
hash *functions*, but it does not provide RDF properties for linking to a
particular hash.

I guess you *could* re-purpose URIs like
<http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256>
as a property, but then why didn't LOC declare them also as such, given
that they have other vocabularies?


We should not use it blindly as a property without agreeing what should
be a valid subject and object for its use - e.g. would these theoretical
properties expect the hash value as bytes, a hex string (with or without
spaces? Upper case, lower case or both?), or as a separate HashValue
resource?


One possibility, if you can avoid insecure MD5 and SHA1, is to
use RFC6920 nih: URIs as identifiers https://tools.ietf.org/html/rfc6920
(or shorter ni: which use base64 encoding)

for instance:

{ "@context": "https://w3id.org/ro/crate/1.0/context",
  "@graph": [

    {
      "@type": "CreativeWork",
      "@id": "ro-crate-metadata.jsonld",
      "conformsTo": {"@id": "https://w3id.org/ro/crate/1.0"},
      "about": {"@id": "./"},
      "description": "RO-Crate Metadata File Descriptor (this file)"
    },
    {
      "@id": "./",
      "@type": "Dataset",
      "name": "Example RO-Crate",
      "description": "The RO-Crate Root Data Entity",
      "hasPart": [
        {"@id": "data1.txt"},
      ]
    },
    {
      "@id": "data1.txt",
      "@type": "File",
      "description": "One of hopefully many Data Entities",
      "identifier": { "@id": "nih:sha-256;5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03"}
    },
 ]
}


An advantage of NI is that they can be rewritten to .well-known http
URIs for retrieval - you can then retrieve from any supporting content-delivery 
platform as you can check the hash afterwards
https://tools.ietf.org/html/rfc6920#section-4


As for RO-Crate a more elaborate alternative where you won't need to
parse the URI is to use a https://schema.org/PropertyValue similar to 
https://w3id.org/ro/crate/1.0/#repository-specific-identifiers
but linking to the id.loc identifiers:

    {
      "@id": "data1.txt",
      "@type": "File",
      "description": "A file containing only the line: hello"
      "identifier": { "@id": "nih:sha-256;5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03"}
    },
    { "@id": "nih:sha-256;5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03",
      "@type": "PropertyValue",
      "name": "sha256",
      "unitText", "hexadecimal"
      "propertyID": { "@id": "http://id.loc.gov/vocabulary/preservation/cryptographicHashFunctions/sha256"},
      "value": "5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03"
    }

Here I used the more lightweight "propertyID" and "hexadecimal" as
unitText, so it would be more of a convention. The RFC6920 URIs are far
more rigorously defined and thus my preference, but at the expense of
parsing.


In this combined approach you get best of both worlds - you have a
global content-based @id URI for the data file (content), and you have
exposed the sha256 hash value as a separate property so you don't need
to parse that URI.


BTW, the RO-Crate community don't usually communicate on this list (but
perhaps we should), could you raise this as a Use Case on 
https://github.com/researchobject/ro-crate/issues ? 

Feel free to link to my reply!


We can then discuss it on the next RO-Crate telcon, which is 
scheduled for 2020-02-27
<https://s.apache.org/ro-crate-minutes>

It might also be worth asking on the schema.org list as there might be
others there dealing with hash values.

-- 
Stian Soiland-Reyes
The University of Manchester 🐝
https://www.esciencelab.org.uk/
https://orcid.org/0000-0001-9842-9718

Received on Wednesday, 12 February 2020 14:39:47 UTC