Re: generating globally unique triple identifier as IRIs from Eric Prud'hommeaux on 2018-07-25 (semantic-web@w3.org from July 2018)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Wed, 25 Jul 2018 04:18:53 -0400
To: Dimitris Kontokostas <jimkont@gmail.com>
Cc: blake.regalia@gmail.com, Hugh Glaser <hugh@glasers.org>, tl@rat.io, semantic-web@w3.org, hans.teijgeler@quicknet.nl
Message-ID: <20180725081852.GZ8681@w3.org>
* Dimitris Kontokostas <jimkont@gmail.com> [2018-07-25 10:10+0300]
> Hi,
> a simple approach would be to generate a unique string from the triple in
> the form of e.g.
> 
> "$subjectIri - $predicateIri - $objectStringValueOrIri - $datatypeOREmpty -
> $langOrEmpty"
> (it could also be the hash of the NTriples representation of the triple)
> 
> then perform a strong hash algorithm on the string, e.g. sha256 and append
> that to a namespace of your choice
> if you find the IRI  too big you can pass the hash through a url-safe base
> 64 function

This is mostly from memory and hearsay. Apologies for any misinformation.

WikiData does something like this for Statement identifiers, but
they're coming from a property graph (i.e. statements can have
attribute/value pairs) to RDF. This leads to a couple modeling
challenges, including that the fact that they don't normalize before
summing so a statement like:

wds:Q42591-22E951C9-6AAD-4BC8-9A38-6675186F94D0 a wikibase:Statement,
  wikibase:BestRank ;
 wikibase:rank wikibase:NormalRank ;
 ps:P910 wd:Q6895476 ;
 prov:wasDerivedFrom wdref:2e7fa0083ba1c302cdcf3fa65c613f90daeb4c37 .

would have a different wds:Q identifier if the attributes were in a
different order. This isn't exactly analogous to your issue but does
show you why normalization has a couple advantageous features:

1. semantic equivalent assertions have the same ID.
2. arbitrary reorderings of assertions don't break URLs.


> Best,
> Dimitris
> 
> p.s. of course, blank nodes cannot go this route
> 
> 
> On Wed, Jul 25, 2018 at 3:43 AM Blake Regalia <blake.regalia@gmail.com>
> wrote:
> 
> > Thomas,
> >
> >   It's worth checking out the RDF Dataset Normalization Algorithm (aka the
> > Universal RDF Dataset Normalization Algorithm 2015 or URDNA2015):
> > https://json-ld.github.io/normalization/spec/
> >
> > Essentially, you would normalize the triple as if it were the only triple
> > in a graph, then perform a hash (SHA-256) on the resulting (normalized)
> > n-quads string and use that as the suffix to some IRI you mint.
> >
> >  - Blake
> >
> >
> > On Tue, Jul 24, 2018 at 12:44 PM Hugh Glaser <hugh@glasers.org> wrote:
> >
> >> Hi,
> >>
> >> I think UUIDs don't satisfy
> >> >> - reproducible: the same triple must always get the same identifier
> >> but come close.
> >>
> >> For the non-domain bit:
> >> Can't you just use a (big)hash of the (normalised?) text of the triple?
> >> (So, perhaps expand any namespace declarations, if you want to be
> >> independent of the namespace identifiers chosen, ensure the separators are
> >> all something appropriate, and think about case-sensitivity.)
> >>
> >> Then do something like (php)
> >> function hashIt($name) {
> >>     $hash = sha1(mb_strtolower(trim($name)));
> >>     return substr($hash, 0, 8) .'-'. substr($hash, 8, 8) .'-'.
> >> substr($hash, 16, 8) .'-'. substr($hash, 24, 8) .'-'. substr($hash, 32,8);
> >> }
> >> (I use this for a similar purpose to make text fuse, hence the
> >> mb_strtolower, but you probably don't want that.)
> >>
> >> which gives the ability to do
> >> http://www.example.com/triple/ee2becf1-1fc8a9c5-55f98825-38c859a6-81e24602
> >> for
> >> <http://www.w3.org/2001/08/rdf-test/>  <
> >> http://purl.org/dc/elements/1.1/creator>    "Dave Beckett" .
> >>
> >> It looks like a UUID, but definitely isn't, because it doesn't have the
> >> stuff at the start of the hash or conform to any of the versions of the ISO
> >> standard.
> >> You could add the start stuff if you want (at the cost of a little
> >> entropy, of course.)
> >>
> >> It also doesn't have any of the time-stamp or MAC address salting of some
> >> UUID versions.
> >> You aren't quite clear on that bit:
> >> > - reproducible: the same triple must always get the same identifier
> >> Do you mean if generated by a machine anywhere?
> >> If so, then the straight hash is good - if not you want to add salt of a
> >> local machine parameter, such as MAC address.
> >>
> >> For the domain(s):
> >> I'm not sure what your requirements are.
> >> Do you just want the identifiers to fuse for the same triples?
> >> But not be able to find out anything about them?
> >> That is, a one-way encoding?
> >> That is, opaque IDs?
> >> If so, you can choose anything you like, s long as it doesn't resolve to
> >> anything sensible.
> >> A more interesting thing is to use the domain of the machine doing the ID
> >> generation, and then put the triple at the endpoint for the ID - so
> >> resolution of the ID will return the triple (in text, or whatever you
> >> fancy, maybe RDF?)
> >> You can always make the resolution require authentication if you want the
> >> triple to stay opaque for normal use.
> >>
> >> Best
> >> Hugh
> >>
> >> > On 24 Jul 2018, at 17:17, Hans Teijgeler <hans.teijgeler@quicknet.nl>
> >> wrote:
> >> >
> >> > Hi Thomas,
> >> >
> >> > We use UUIDs throughout, such as
> >> http://www.example.com/c00c7d47-a04a-46c5-9d81-623def3ff31c
> >> >
> >> > Look, for example, at https://www.uuidgenerator.net/
> >> > Regards, Hans
> >> > 15826,org
> >> > On 24-7-2018 17:56, thomas lörtsch wrote:
> >> >> I’m looking for an algorithm to generate identifiers for any possible
> >> RDF statements that are themselve IRIs, globally unique, reproducible and
> >> not dependend on the source of the statement (i.e. document URI).
> >> >>
> >> >> In other words, the algorithm has to have the following properties:
> >> >> - reproducible: the same triple must always get the same identifier
> >> >> - global: the identifier must be independent from the source of the
> >> triple
> >> >>   (a document title, named graph name, graph URI etc)
> >> >> - unique: triples that differ in subject, predicate or object must
> >> have different identifiers
> >> >> - conformant: the identifier must be an IRI
> >> >> - autonomous and finite: it can’t rely on any central repository,
> >> secondary service etc
> >> >>
> >> >> Does such an algorithm exist? Is it even possible given that each node
> >> in the stateent could be an IRI of maximal length and entropy?
> >> >>
> >> >>
> >> >> Thanks,
> >> >> Thomas
> >> >>
> >> >
> >> >
> >> >       Virusvrij. www.avg.com
> >>
> >> --
> >> Hugh
> >> 023 8061 5652
> >>
> >>
> >>
> 
> -- 
> Kontokostas Dimitris

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.
Received on Wednesday, 25 July 2018 08:19:02 UTC