Re: generating globally unique triple identifier as IRIs

From: Hugh Glaser <hugh@glasers.org>
Date: Tue, 24 Jul 2018 18:43:05 +0100
Cc: semantic-web <semantic-web@w3.org>, Hans Teijgeler <hans.teijgeler@quicknet.nl>
Message-Id: <231A787E-6173-415B-B0ED-610BDD08B82A@glasers.org>
To: tl@rat.io

I think UUIDs don't satisfy 
>> - reproducible: the same triple must always get the same identifier
but come close.

For the non-domain bit:
Can't you just use a (big)hash of the (normalised?) text of the triple?
(So, perhaps expand any namespace declarations, if you want to be independent of the namespace identifiers chosen, ensure the separators are all something appropriate, and think about case-sensitivity.)

Then do something like (php)
function hashIt($name) {
    $hash = sha1(mb_strtolower(trim($name)));
    return substr($hash, 0, 8) .'-'. substr($hash, 8, 8) .'-'. substr($hash, 16, 8) .'-'. substr($hash, 24, 8) .'-'. substr($hash, 32,8);
(I use this for a similar purpose to make text fuse, hence the mb_strtolower, but you probably don't want that.)

which gives the ability to do
<http://www.w3.org/2001/08/rdf-test/>  <http://purl.org/dc/elements/1.1/creator>    "Dave Beckett" .

It looks like a UUID, but definitely isn't, because it doesn't have the stuff at the start of the hash or conform to any of the versions of the ISO standard.
You could add the start stuff if you want (at the cost of a little entropy, of course.)

It also doesn't have any of the time-stamp or MAC address salting of some UUID versions.
You aren't quite clear on that bit:
> - reproducible: the same triple must always get the same identifier
Do you mean if generated by a machine anywhere?
If so, then the straight hash is good - if not you want to add salt of a local machine parameter, such as MAC address.

For the domain(s):
I'm not sure what your requirements are.
Do you just want the identifiers to fuse for the same triples?
But not be able to find out anything about them?
That is, a one-way encoding?
That is, opaque IDs?
If so, you can choose anything you like, s long as it doesn't resolve to anything sensible.
A more interesting thing is to use the domain of the machine doing the ID generation, and then put the triple at the endpoint for the ID - so resolution of the ID will return the triple (in text, or whatever you fancy, maybe RDF?)
You can always make the resolution require authentication if you want the triple to stay opaque for normal use.


> On 24 Jul 2018, at 17:17, Hans Teijgeler <hans.teijgeler@quicknet.nl> wrote:
> Hi Thomas,
> We use UUIDs throughout, such as http://www.example.com/c00c7d47-a04a-46c5-9d81-623def3ff31c
> Look, for example, at https://www.uuidgenerator.net/
> Regards, Hans
> 15826,org
> On 24-7-2018 17:56, thomas lörtsch wrote:
>> I’m looking for an algorithm to generate identifiers for any possible RDF statements that are themselve IRIs, globally unique, reproducible and not dependend on the source of the statement (i.e. document URI).
>> In other words, the algorithm has to have the following properties:
>> - reproducible: the same triple must always get the same identifier
>> - global: the identifier must be independent from the source of the triple 
>>   (a document title, named graph name, graph URI etc)
>> - unique: triples that differ in subject, predicate or object must have different identifiers
>> - conformant: the identifier must be an IRI
>> - autonomous and finite: it can’t rely on any central repository, secondary service etc
>> Does such an algorithm exist? Is it even possible given that each node in the stateent could be an IRI of maximal length and entropy? 
>> Thanks,
>> Thomas
