Re: generating globally unique triple identifier as IRIs from Eric Prud'hommeaux on 2018-07-25 (semantic-web@w3.org from July 2018)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Wed, 25 Jul 2018 05:11:47 -0400
To: thomas lörtsch <tl@rat.io>
Cc: Dimitris Kontokostas <jimkont@gmail.com>, blake.regalia@gmail.com, Hugh Glaser <hugh@glasers.org>, semantic-web@w3.org, hans.teijgeler@quicknet.nl
Message-ID: <20180725091146.GA8681@w3.org>
* thomas lörtsch <tl@rat.io> [2018-07-25 10:36+0200]
> Thanks a lot everybody for all the input and helpful comments!
> 
> I found it astonishingly difficult to define exactly what I actually want and the first description I gave was already long but still incomplete. I reworded and extended it slightly. A triple identifier must have the following properties:
> 
>   - compliant: the identifier must be an IRI
>   - unique: the identifier must be distinct from all other IRIs
>   - reproducible: the same triple must always get the same identifier
>   - discriminate: triples that differ in subject, predicate or object must have 
>     different identifiers
>   - reciprocal: an identifier without a token doesn't identify anything
>   - global: the identifier must be independent from the source of the triple
>   - purportless: the identifier can't be expected to carry any meaning whatsoever
>   - independent: it can’t rely on any central repository, secondary service etc
>   - (optional) reversible: the token can be reestablished from the ID 

`reversible` seems to be at odds with `purportless`, but that depends on the audience. These are all features that should be apparent to the outside world; if you create identifiers which have special meaning to your infrastructure (e.g. the path can be parsed into triples), that's fine, and frequently, the only way to do things. For example, many URLs are composed from a unique key in a database which is used when a query interface serves that URL. REST architecture says you you can stuff as much semantics as you want in there; just don't encourage anyone outside your domain to parse those URLs; they should instead follow explicit arcs (typically conferred in HTTP links in REST) to get to e.g. the original triples, provenance, confidence metrics, etc.


> However I also discovered in the meantime that there is indeed no fixed length limit on IRIs specified which gives some wiggle room — a lot actually, invalidating my most basic concern. 
> 
> So some namespace combined with some compression algorithm (and if applicable some blank node skolemization) (and if necessary some cryptography applied to that) should do the trick. I’m leaving it at that for the moment as for a POC that seems quite sufficient.
> 
> Thanks again,
> Thomas
> 
> 
> > On 25. Jul 2018, at 09:10, Dimitris Kontokostas <jimkont@gmail.com> wrote:
> > 
> > Hi, 
> > a simple approach would be to generate a unique string from the triple in the form of e.g. 
> > 
> > "$subjectIri - $predicateIri - $objectStringValueOrIri - $datatypeOREmpty - $langOrEmpty"
> > (it could also be the hash of the NTriples representation of the triple)
> > 
> > then perform a strong hash algorithm on the string, e.g. sha256 and append that to a namespace of your choice
> > if you find the IRI  too big you can pass the hash through a url-safe base 64 function
> > 
> > Best,
> > Dimitris
> > 
> > p.s. of course, blank nodes cannot go this route 
> > 
> > 
> > On Wed, Jul 25, 2018 at 3:43 AM Blake Regalia <blake.regalia@gmail.com> wrote:
> > Thomas,
> > 
> >   It's worth checking out the RDF Dataset Normalization Algorithm (aka the Universal RDF Dataset Normalization Algorithm 2015 or URDNA2015): https://json-ld.github.io/normalization/spec/
> > 
> > Essentially, you would normalize the triple as if it were the only triple in a graph, then perform a hash (SHA-256) on the resulting (normalized) n-quads string and use that as the suffix to some IRI you mint.
> > 
> >  - Blake
> > 
> > 
> > On Tue, Jul 24, 2018 at 12:44 PM Hugh Glaser <hugh@glasers.org> wrote:
> > Hi,
> > 
> > I think UUIDs don't satisfy 
> > >> - reproducible: the same triple must always get the same identifier
> > but come close.
> > 
> > For the non-domain bit:
> > Can't you just use a (big)hash of the (normalised?) text of the triple?
> > (So, perhaps expand any namespace declarations, if you want to be independent of the namespace identifiers chosen, ensure the separators are all something appropriate, and think about case-sensitivity.)
> > 
> > Then do something like (php)
> > function hashIt($name) {
> >     $hash = sha1(mb_strtolower(trim($name)));
> >     return substr($hash, 0, 8) .'-'. substr($hash, 8, 8) .'-'. substr($hash, 16, 8) .'-'. substr($hash, 24, 8) .'-'. substr($hash, 32,8);
> > }
> > (I use this for a similar purpose to make text fuse, hence the mb_strtolower, but you probably don't want that.)
> > 
> > which gives the ability to do
> > http://www.example.com/triple/ee2becf1-1fc8a9c5-55f98825-38c859a6-81e24602
> > for
> > <http://www.w3.org/2001/08/rdf-test/>  <http://purl.org/dc/elements/1.1/creator>    "Dave Beckett" .
> > 
> > It looks like a UUID, but definitely isn't, because it doesn't have the stuff at the start of the hash or conform to any of the versions of the ISO standard.
> > You could add the start stuff if you want (at the cost of a little entropy, of course.)
> > 
> > It also doesn't have any of the time-stamp or MAC address salting of some UUID versions.
> > You aren't quite clear on that bit:
> > > - reproducible: the same triple must always get the same identifier
> > Do you mean if generated by a machine anywhere?
> > If so, then the straight hash is good - if not you want to add salt of a local machine parameter, such as MAC address.
> > 
> > For the domain(s):
> > I'm not sure what your requirements are.
> > Do you just want the identifiers to fuse for the same triples?
> > But not be able to find out anything about them?
> > That is, a one-way encoding?
> > That is, opaque IDs?
> > If so, you can choose anything you like, s long as it doesn't resolve to anything sensible.
> > A more interesting thing is to use the domain of the machine doing the ID generation, and then put the triple at the endpoint for the ID - so resolution of the ID will return the triple (in text, or whatever you fancy, maybe RDF?)
> > You can always make the resolution require authentication if you want the triple to stay opaque for normal use.
> > 
> > Best
> > Hugh
> > 
> > > On 24 Jul 2018, at 17:17, Hans Teijgeler <hans.teijgeler@quicknet.nl> wrote:
> > > 
> > > Hi Thomas,
> > > 
> > > We use UUIDs throughout, such as http://www.example.com/c00c7d47-a04a-46c5-9d81-623def3ff31c
> > > 
> > > Look, for example, at https://www.uuidgenerator.net/
> > > Regards, Hans
> > > 15826,org
> > > On 24-7-2018 17:56, thomas lörtsch wrote:
> > >> I’m looking for an algorithm to generate identifiers for any possible RDF statements that are themselve IRIs, globally unique, reproducible and not dependend on the source of the statement (i.e. document URI).
> > >> 
> > >> In other words, the algorithm has to have the following properties:
> > >> - reproducible: the same triple must always get the same identifier
> > >> - global: the identifier must be independent from the source of the triple 
> > >>   (a document title, named graph name, graph URI etc)
> > >> - unique: triples that differ in subject, predicate or object must have different identifiers
> > >> - conformant: the identifier must be an IRI
> > >> - autonomous and finite: it can’t rely on any central repository, secondary service etc
> > >> 
> > >> Does such an algorithm exist? Is it even possible given that each node in the stateent could be an IRI of maximal length and entropy? 
> > >> 
> > >> 
> > >> Thanks,
> > >> Thomas
> > >> 
> > > 
> > > 
> > >       Virusvrij. www.avg.com
> > 
> > -- 
> > Hugh
> > 023 8061 5652
> > 
> > 
> > 
> > 
> > -- 
> > Kontokostas Dimitris
> 
> 

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.
Received on Wednesday, 25 July 2018 09:11:58 UTC