Re: generating globally unique triple identifier as IRIs from Blake Regalia on 2018-07-25 (semantic-web@w3.org from July 2018)

From: Blake Regalia <blake.regalia@gmail.com>
Date: Tue, 24 Jul 2018 17:36:47 -0700
To: hugh@glasers.org
Cc: tl@rat.io, semantic-web@w3.org, hans.teijgeler@quicknet.nl
Message-ID: <CANMU0MGaFo3ngOr4DC-JmSGmk7naLs85rYL-OGnxzEdmwSmr+g@mail.gmail.com>
Thomas,

  It's worth checking out the RDF Dataset Normalization Algorithm (aka the
Universal RDF Dataset Normalization Algorithm 2015 or URDNA2015):
https://json-ld.github.io/normalization/spec/

Essentially, you would normalize the triple as if it were the only triple
in a graph, then perform a hash (SHA-256) on the resulting (normalized)
n-quads string and use that as the suffix to some IRI you mint.

 - Blake


On Tue, Jul 24, 2018 at 12:44 PM Hugh Glaser <hugh@glasers.org> wrote:

> Hi,
>
> I think UUIDs don't satisfy
> >> - reproducible: the same triple must always get the same identifier
> but come close.
>
> For the non-domain bit:
> Can't you just use a (big)hash of the (normalised?) text of the triple?
> (So, perhaps expand any namespace declarations, if you want to be
> independent of the namespace identifiers chosen, ensure the separators are
> all something appropriate, and think about case-sensitivity.)
>
> Then do something like (php)
> function hashIt($name) {
>     $hash = sha1(mb_strtolower(trim($name)));
>     return substr($hash, 0, 8) .'-'. substr($hash, 8, 8) .'-'.
> substr($hash, 16, 8) .'-'. substr($hash, 24, 8) .'-'. substr($hash, 32,8);
> }
> (I use this for a similar purpose to make text fuse, hence the
> mb_strtolower, but you probably don't want that.)
>
> which gives the ability to do
> http://www.example.com/triple/ee2becf1-1fc8a9c5-55f98825-38c859a6-81e24602
> for
> <http://www.w3.org/2001/08/rdf-test/>  <
> http://purl.org/dc/elements/1.1/creator>    "Dave Beckett" .
>
> It looks like a UUID, but definitely isn't, because it doesn't have the
> stuff at the start of the hash or conform to any of the versions of the ISO
> standard.
> You could add the start stuff if you want (at the cost of a little
> entropy, of course.)
>
> It also doesn't have any of the time-stamp or MAC address salting of some
> UUID versions.
> You aren't quite clear on that bit:
> > - reproducible: the same triple must always get the same identifier
> Do you mean if generated by a machine anywhere?
> If so, then the straight hash is good - if not you want to add salt of a
> local machine parameter, such as MAC address.
>
> For the domain(s):
> I'm not sure what your requirements are.
> Do you just want the identifiers to fuse for the same triples?
> But not be able to find out anything about them?
> That is, a one-way encoding?
> That is, opaque IDs?
> If so, you can choose anything you like, s long as it doesn't resolve to
> anything sensible.
> A more interesting thing is to use the domain of the machine doing the ID
> generation, and then put the triple at the endpoint for the ID - so
> resolution of the ID will return the triple (in text, or whatever you
> fancy, maybe RDF?)
> You can always make the resolution require authentication if you want the
> triple to stay opaque for normal use.
>
> Best
> Hugh
>
> > On 24 Jul 2018, at 17:17, Hans Teijgeler <hans.teijgeler@quicknet.nl>
> wrote:
> >
> > Hi Thomas,
> >
> > We use UUIDs throughout, such as
> http://www.example.com/c00c7d47-a04a-46c5-9d81-623def3ff31c
> >
> > Look, for example, at https://www.uuidgenerator.net/
> > Regards, Hans
> > 15826,org
> > On 24-7-2018 17:56, thomas lörtsch wrote:
> >> I’m looking for an algorithm to generate identifiers for any possible
> RDF statements that are themselve IRIs, globally unique, reproducible and
> not dependend on the source of the statement (i.e. document URI).
> >>
> >> In other words, the algorithm has to have the following properties:
> >> - reproducible: the same triple must always get the same identifier
> >> - global: the identifier must be independent from the source of the
> triple
> >>   (a document title, named graph name, graph URI etc)
> >> - unique: triples that differ in subject, predicate or object must have
> different identifiers
> >> - conformant: the identifier must be an IRI
> >> - autonomous and finite: it can’t rely on any central repository,
> secondary service etc
> >>
> >> Does such an algorithm exist? Is it even possible given that each node
> in the stateent could be an IRI of maximal length and entropy?
> >>
> >>
> >> Thanks,
> >> Thomas
> >>
> >
> >
> >       Virusvrij. www.avg.com
>
> --
> Hugh
> 023 8061 5652
>
>
>
Received on Wednesday, 25 July 2018 00:37:27 UTC