Re: generating globally unique triple identifier as IRIs from Dimitris Kontokostas on 2018-07-25 (semantic-web@w3.org from July 2018)

From: Dimitris Kontokostas <jimkont@gmail.com>
Date: Wed, 25 Jul 2018 10:10:09 +0300
To: blake.regalia@gmail.com
Cc: Hugh Glaser <hugh@glasers.org>, tl@rat.io, semantic-web@w3.org, hans.teijgeler@quicknet.nl
Message-ID: <CA+u4+a2kmvDb1M4H_==8si2bnv8SnGzjNYL69AjvUxR4qS96FQ@mail.gmail.com>
Hi,
a simple approach would be to generate a unique string from the triple in
the form of e.g.

"$subjectIri - $predicateIri - $objectStringValueOrIri - $datatypeOREmpty -
$langOrEmpty"
(it could also be the hash of the NTriples representation of the triple)

then perform a strong hash algorithm on the string, e.g. sha256 and append
that to a namespace of your choice
if you find the IRI  too big you can pass the hash through a url-safe base
64 function

Best,
Dimitris

p.s. of course, blank nodes cannot go this route


On Wed, Jul 25, 2018 at 3:43 AM Blake Regalia <blake.regalia@gmail.com>
wrote:

> Thomas,
>
>   It's worth checking out the RDF Dataset Normalization Algorithm (aka the
> Universal RDF Dataset Normalization Algorithm 2015 or URDNA2015):
> https://json-ld.github.io/normalization/spec/
>
> Essentially, you would normalize the triple as if it were the only triple
> in a graph, then perform a hash (SHA-256) on the resulting (normalized)
> n-quads string and use that as the suffix to some IRI you mint.
>
>  - Blake
>
>
> On Tue, Jul 24, 2018 at 12:44 PM Hugh Glaser <hugh@glasers.org> wrote:
>
>> Hi,
>>
>> I think UUIDs don't satisfy
>> >> - reproducible: the same triple must always get the same identifier
>> but come close.
>>
>> For the non-domain bit:
>> Can't you just use a (big)hash of the (normalised?) text of the triple?
>> (So, perhaps expand any namespace declarations, if you want to be
>> independent of the namespace identifiers chosen, ensure the separators are
>> all something appropriate, and think about case-sensitivity.)
>>
>> Then do something like (php)
>> function hashIt($name) {
>>     $hash = sha1(mb_strtolower(trim($name)));
>>     return substr($hash, 0, 8) .'-'. substr($hash, 8, 8) .'-'.
>> substr($hash, 16, 8) .'-'. substr($hash, 24, 8) .'-'. substr($hash, 32,8);
>> }
>> (I use this for a similar purpose to make text fuse, hence the
>> mb_strtolower, but you probably don't want that.)
>>
>> which gives the ability to do
>> http://www.example.com/triple/ee2becf1-1fc8a9c5-55f98825-38c859a6-81e24602
>> for
>> <http://www.w3.org/2001/08/rdf-test/>  <
>> http://purl.org/dc/elements/1.1/creator>    "Dave Beckett" .
>>
>> It looks like a UUID, but definitely isn't, because it doesn't have the
>> stuff at the start of the hash or conform to any of the versions of the ISO
>> standard.
>> You could add the start stuff if you want (at the cost of a little
>> entropy, of course.)
>>
>> It also doesn't have any of the time-stamp or MAC address salting of some
>> UUID versions.
>> You aren't quite clear on that bit:
>> > - reproducible: the same triple must always get the same identifier
>> Do you mean if generated by a machine anywhere?
>> If so, then the straight hash is good - if not you want to add salt of a
>> local machine parameter, such as MAC address.
>>
>> For the domain(s):
>> I'm not sure what your requirements are.
>> Do you just want the identifiers to fuse for the same triples?
>> But not be able to find out anything about them?
>> That is, a one-way encoding?
>> That is, opaque IDs?
>> If so, you can choose anything you like, s long as it doesn't resolve to
>> anything sensible.
>> A more interesting thing is to use the domain of the machine doing the ID
>> generation, and then put the triple at the endpoint for the ID - so
>> resolution of the ID will return the triple (in text, or whatever you
>> fancy, maybe RDF?)
>> You can always make the resolution require authentication if you want the
>> triple to stay opaque for normal use.
>>
>> Best
>> Hugh
>>
>> > On 24 Jul 2018, at 17:17, Hans Teijgeler <hans.teijgeler@quicknet.nl>
>> wrote:
>> >
>> > Hi Thomas,
>> >
>> > We use UUIDs throughout, such as
>> http://www.example.com/c00c7d47-a04a-46c5-9d81-623def3ff31c
>> >
>> > Look, for example, at https://www.uuidgenerator.net/
>> > Regards, Hans
>> > 15826,org
>> > On 24-7-2018 17:56, thomas lörtsch wrote:
>> >> I’m looking for an algorithm to generate identifiers for any possible
>> RDF statements that are themselve IRIs, globally unique, reproducible and
>> not dependend on the source of the statement (i.e. document URI).
>> >>
>> >> In other words, the algorithm has to have the following properties:
>> >> - reproducible: the same triple must always get the same identifier
>> >> - global: the identifier must be independent from the source of the
>> triple
>> >>   (a document title, named graph name, graph URI etc)
>> >> - unique: triples that differ in subject, predicate or object must
>> have different identifiers
>> >> - conformant: the identifier must be an IRI
>> >> - autonomous and finite: it can’t rely on any central repository,
>> secondary service etc
>> >>
>> >> Does such an algorithm exist? Is it even possible given that each node
>> in the stateent could be an IRI of maximal length and entropy?
>> >>
>> >>
>> >> Thanks,
>> >> Thomas
>> >>
>> >
>> >
>> >       Virusvrij. www.avg.com
>>
>> --
>> Hugh
>> 023 8061 5652
>>
>>
>>

-- 
Kontokostas Dimitris
Received on Wednesday, 25 July 2018 07:11:00 UTC